How to remove invalid characters from XML

Invalid XML CharactersThe XML specification supports a very specific character set. Characters that fall outside of the specified ranges result in a parser error whenever the XML string is parsed.

The XML specification deals with character sets in section 2.2. of the XML specification.The following characters are supported by XML:

Char   ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

This means that any character falling outside of the abovementioned specified character set is considered invalid.
Characters that fall outside of the specified ranges typically result in a parser error whenever the XML string is parsed:

  • INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified; or
  • An invalid XML character was found in the element content of the document; or
  • Etcetera…

To address the abovementioned errors: Ensure that the input XML string is compliant by excluding character ranges that fall outside of the specification. Typically these characters are either non-printable or readable and are artefacts of uncontrolled input by outside sources – This is to be expected whenever the input XML originates from other systems or user input.

Note: Always insist on well-formed XML from other systems. This reduces the maintenance overhead and the occurrence of parser failures.

Example

The following example strips any characters that fall outside of the XML version 1.0 standard and returns a string with only allowable characters. This approach uses a regular expression to identify invalid characters and then replace them with an empty string. The regular expression to identify the invalid characters uses the valid character set and then negates it.

/**
  * This method ensures that the output String has only valid XML unicode
  * characters as specified by the XML 1.0 standard. For reference, please
  * see <a href=”http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char”>the
  * standard</a>.
  *
  * @param in The String whose non-valid characters we want to remove.
  * @return The in String, stripped of non-valid characters.
  */
public static String stripInvalidXMLCharacters(String in) {
     // XML 1.0
     // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
     String xml10pattern =
             “[^” +
             “\u0009\r\n” +
             “\u0020-\uD7FF” +
             “\uE000-\uFFFD” +
             “\ud800\udc00-\udbff\udfff” +
             “]”;
     return in.replaceAll(xml10pattern, “”).trim();
}

Posted in XML and tagged , .

Leave a Reply