How to use the correct XML character encoding

XML character encodingThe XML specification allows for the use of various encodings. UTF-8 or UTF-16 are typically expected; however, it is recognised that other encodings exist in the world: Correctly encoding XML allows for portable applications.

Different XML character encodings are used when representing foreign languages or character sequences that fall outside of the standard UTF-8 scope. The XML specification describes character encoding under section 4.3.3: Character Encoding in Entities.

Note: Assuming that every written language will be supported by UTF-8 will only result in software that is non-portable. Always conduct an exercise to select the correct character encoding applicable to requirements.

Defining encoding for an XML document is approached from 3 different aspects:



ItemDescription

Document Content

The document content determines the target encoding. In other words: the language and / or character set requirements determine the target character encoding. For example: German or Japanese languages have very different character representations that standard English.

XML Document Declaration

The XML document declaration defines the target encoding. The “encoding” attribute provides the mechanism to specify the encoding for the benefit of the XML parser. Refer to the entry on well-formed XML document structure for more information on the use of the XML document declaration.

The XML parser uses this declaration to process the entity character encodings. Example declarations :

  • <?xml encoding=’UTF-8′?>
  • <?xml encoding=’EUC-JP’?>
  • <?xml encoding=ISO-8859-1′?>
  • etc…

Storage Encoding

XML is stored as strings encoded to bytes. The character encoding determines how strings are transformed into the byte presentation and vice-versa. This does have an impact on the character sets which can be stored. If done incorrectly it can lead to data corruption and / or loss.

XML parsers will generally accept all valid encodings; however, when the xml content, document encoding declaration or underlying string / storage encoding does not match, then the parser will throw an exception similar to the following:

  • Invalid byte x of y-byte UTF-8 sequence.

Example

The following method takes an input string of an undetermined character encoding and forces the XML parser to use the specified encoding. In short the method does three things:

  • Strips the XML declaration and replaces it with a declaration the specifies the required encoding.
  • Converts the input string to a byte array using the specified encoding.
  • Defines the encoding on the input source for the document parser.

public static Document parseDocument(String xmlString, boolean validating, boolean namespaceaware, String encoding) throws SAXException, ParserConfigurationException, IOException {
         InputStream inStream = null;
         try {
             xmlString = xmlString.replaceAll(“\\<\\?xml(.+?)\\?\\>”, “<?xml version=\”1.0\” encoding=\”” + encoding + “\”?>”);
             inStream = new ByteArrayInputStream(xmlString.getBytes(encoding));
            

   InputSource is = new InputSource(inStream);
   is.setEncoding(encoding);
        
   Document xmlDoc = null;
   DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
   factory.setValidating(validating);
   factory.setNamespaceAware(namespaceaware);
   DocumentBuilder builder = factory.newDocumentBuilder();
   xmlDoc = builder.parse(is);
   return xmlDoc;

         } finally {
             IOUtility.closeQuietly(inStream);
         }
     }

Posted in XML and tagged , .

Leave a Reply