Breaking News
Home / Java / Java Tutorials / XML Parsing failing due Encoding not being UTF-8

XML Parsing failing due Encoding not being UTF-8

XML Encoding IssueAnyone who has works with XML a lot possibly would have encountered this XML parser error message such as “Invalid byte 2 of 2-byte UTF-8 sequence”. The main reason for this error is that, the XML Parser cannot parse the XML since it contains non UTF-8 encoded characters.

The exception stack trace looks some thing like this.

[Fatal Error] :1:25: Invalid byte 1 of 1-byte UTF-8 sequence.
Exception in thread “main” org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)

This problem occurs when the String version of the XML is converted into a byte array. The method getBytes() called on the XML String, which converts the String into an array of bytes using the platform default encoding. The default platform encoding depends on the operating system and the spoken language for which the OS is setup. Linux and Solaris have a default of UTF-8 encoding, however Windows has default encoding of US English – Cp1252.

Sample XML that is not UTF-8 Encoded

<customer>
<name>MUÑIZ'</name>
<country>USA</country>
</customer>

Closely look at the name tag value in above XML, it has non-ascii characters that need encoding.

Can be Problem Code depending on OS

 public Document parse(String xml) throws ParsingFailedException
{
try
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();

Document doc = builder.parse(new ByteArrayInputStream(xml));
log.error(“XML parsing OK”);

return doc;
}
catch (Exception e)
{
log.error(“Parser Error:” + e.getMessage());
throw new ParsingFailedException(“Failed to parse XML : Document not well formed”, e);
}
}

This can the case with both SAX or DOM parsers. In order to solve this we want to make sure that we encode the XML as UTF-8 before handing it over to the parser to parse it.

Correct Code to avoid Problem

 public Document parse(String xml) throws ParsingFailedException
{
try
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();

// encode the xml to UTF -8

                  ByteArrayInputStream encXML =

                    new  ByteArrayInputStream(xml.getBytes(“UTF8”));

            Document doc = builder.parse(encXML);
log.error(“XML parsing OK”);

return doc;
}
catch (Exception e)
{
log.error(“Parser Error:” + e.getMessage());
throw new ParsingFailedException(“Failed to parse XML : Document not well formed”, e);
}
}

To find the default encoding of your OS, you can use

System.getProperty(“file.encoding”);

 

 public Document parse(String xml) throws ParsingFailedException
{
try
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();

// encode the xml to UTF -8
String encXML = new ByteArrayInputStream(xml.getBytes(“UTF8”));

Document doc = builder.parse(encXML);
log.error(“XML parsing OK”);

return doc;
}
catch (Exception e)
{
log.error(“Parser Error:” + e.getMessage());
throw new ParsingFailedException(“Failed to parse XML : Document not well formed”, e);
}
}

[pb_builder]

Check Also

Have a Question

How to select a JVM Analyzing Tool?

A JVM monitoring tool also know as a Diagnostic might be needed to fine tune …

Advertisment ad adsense adlogger