1

I am parsing an XML file which has UTF-8 encoding.

<?xml version="1.0" encoding="UTF-8"?>

Now our business application has set of components which are developed by different teams and are not using the same libraries for parsing XML. My component uses JAXB while some other component uses SAX and so forth. Now when XML file has special characters like "ä" or "ë" or "é" (characters with umlauts) JAXB parses it properly but other components (sub-apps) could not parse them properly and throws exception.

Due to business need I can not change programming for other components but I have to put restriction/validation at my application for making sure that XML (data-load) file do not contain any such characters.

What is best approach to make sure that file does not contain above mentioned (or similar) characters and I can throw exception (or give error) right there before I start parsing XML file using JAXB.

Brad Larson
  • 170,088
  • 45
  • 397
  • 571
deej
  • 2,536
  • 4
  • 29
  • 51
  • sounds as simple as your question - check the file, if it contains invalid characters... if you cannot rely on the header information, then you have to encode the file by yourself and see if it crashes... you can read a file by using a certain encoding, see http://stackoverflow.com/questions/3043710/java-inputstream-encoding-charset – Martin Frank Jul 28 '14 at 11:26
  • 1
    The behavior you describe is in fact impossible *unless* either your XML states to be `encoding="UTF-8"` and in reality is not, or the other component you feed it to ignores the XML declaration and tries to parse it as a legacy encoding (very unlikely). I would bet on the first situation: You create XML with the wrong encoding. Correct the declaration to match your file encoding, or correct your file encoding to be UTF-8. **To tell what is the case here**, we'd need a hexadecimal snippet from an affected file. – Tomalak Jul 28 '14 at 11:27
  • See another related post here http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream . – Keshava Jul 28 '14 at 11:28
  • It is possible because these files are coming from various customers and they are generating XML files in format we expect but might just be putting XML headers without respecting what data they are putting in. We are not sure what tools and technologies those customers might be using. – deej Jul 28 '14 at 11:31
  • @MartinFrank I am not sure what all characters can create problem so it is best to make sure that file does not have special characters. I am just thinking out loud is there a way to validate file against all non-ascii characters? – deej Jul 28 '14 at 11:33
  • @Tomalak I concur with the "impossible", but I can't follow the reasoning that follows. OP says, *he* can parse using JAXB, so the file should be OK and UTF-8. And, using JAXB, it's rather difficult to create an XML file with a header saying encoding="UTF-8" and erroneously encoded. – laune Jul 28 '14 at 11:38
  • "We are not sure what tools and technologies those customers might be using." This is putting it *very* politely and diplomatically. – laune Jul 28 '14 at 11:46
  • @laune our customers are vendors and they do not use tools provided by us to generate this XML file which we are using as data-feed for our application. They possibly are using their own home-grown tool to provide us file in format format (XML tags and encoding) but they may fail due to one or another reason if they are not doing it right. – deej Jul 28 '14 at 11:50
  • @laune I figured that JAXB might have some sort of document encoding detection that ignores the XML declaration and allows parsing documents with a wrong encoding hint. As I said, as long as we don't see a hex dump of an affected file it's impossible to tell whether it is all-right or not. – Tomalak Jul 28 '14 at 12:29
  • @Tomalek Only the XML header. You can parse any ISO 8859-x using any other ISO 8859-y and it'll succeed and produce gibberish. Even with a hex dump there's no telling which encoding it is. For example: `c3 a4 c3 b6 c3 bc 0a` You can decode this as "äöü" or *aöü*, or several other possibilities. What is it really? – laune Jul 28 '14 at 12:43

3 Answers3

1

If your customer sends you an XML file with a header where the encoding does not match file contents, you might as well give up to try and do anything meaningful with that file. - Are they really sending data where the header does not match the actual encoding? That's not XML, then. And you ought to charge them more ;-)

Simply read the file as a FileInputStream, byte by byte. If it contains a negative byte value, refuse to process it.

You can keep encoding settings like UTF-8 or ISO 8859-1, because they all have US-ASCII as a proper subset.

laune
  • 31,114
  • 3
  • 29
  • 42
1

yes, my answer would be the same as laune mentions...

static boolean readInput() {
    boolean isValid = true;
    StringBuffer buffer = new StringBuffer();
    try {
        FileInputStream fis = new FileInputStream("test.txt");
        InputStreamReader isr = new InputStreamReader(fis);
        Reader in = new BufferedReader(isr);
        int ch;
        while ((ch = in.read()) > -1) {
            buffer.append((char)ch);
            System.out.println("ch="+ch);
            //TODO - check range for each character 
            //according the wikipedia table http://en.wikipedia.org/wiki/UTF-8
            //if it's a valid utf-8 character
            //if it's not in range, the isValid=false;
            //and you can break here...
        }
        in.close();
        return isValid;
    } 
    catch (IOException e) {
        e.printStackTrace();
        return false;
    }
}

i'm just adding a code snippet...

Martin Frank
  • 3,445
  • 1
  • 27
  • 47
1

You should be able to wrap the XML input in a java.io.Reader in which you specify the actual encoding and then process that normally. Java will leverage the encoding specified in the XML for an InputStream, but when a Reader is used, the encoding of the Reader will be used.

Unmarshaller unmarshaller = jc.createUnmarshaller();
InputStream inputStream = new FileInputStream("input.xml");
Reader reader = new InputStreamReader(inputStream, "UTF-16");
try {
    Address address = (Address) unmarshaller.unmarshal(reader);
} finally  {
    reader.close();
}
bdoughan
  • 147,609
  • 23
  • 300
  • 400