How do I detect unicode characters in a Java string to resolve sax parser exception

Question

Suppose I have a string that contains '¿'. How would I find all those unicode characters? Should I test for their code? How would I do that?

I want to detect it to avoid sax parser exception which I am getting it while parsing the xml saved as a clob in oracle 10g database.

Exception javax.servlet.ServletException: org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.

See this answer http://stackoverflow.com/questions/1673544/how-do-i-detect-unicode-characters-in-a-java-string — Casey, May 28 '10 at 04:05
I have already read the answers posted in that link, but I guess, my requirement is lil different — mita, May 28 '10 at 05:57

score 0 · Answer 1 · answered May 28 '10 at 05:06

This too long for a comment so I make it an answer altough it's not an answer...

First you're confused on what "Unicode" means. ASCII is a subset of Unicode, for example. Every valid ASCII character is a valid Unicode character.

Then you're probably confused as to the distinction between a code an its actual representation. For example, ASCII is purely a 7-bit encoding: it defines 128 "codepoints" (actually, it's first commercial use was for a seven-bit teleprinter: http://en.wikipedia.org/wiki/ASCII). Altough a 7-bit encoding, ASCII is typically nowadays always encoded on 8-bit, with the leftmost/highest bit always cleared.

Unicode defines more than 65536 codepoints. There are several ways to represent Unicode codepoints, UTF-8 being one of them.

One of the particular useful feature of UTF-8 is that any valid 8-bit ASCII text file (where every byte has its leftmost/highest bit clear) is always a valid UTF-8 / Unicode file too.

What are you after? Finding every character that is not an ASCII character?

Anyway it is actually complicated to do correctly in Java. Because Java was conceived before Unicode 3.1, when there were less than 65536 Unicode points, the Java char primitive is a completely broken abstraction of a Unicode codepoint (Unicode, since more than 10 years, has more than 65536 codepoints). So came Java 1.5/5 and the new "codepoint" related methods: it's a bit better, but you still can't easily "iterate" over the codepoint: the codepointAt(...) method is incredibly confusing in that it gives codepoint, but works by having its index argument counting in Java char (which btw is a Sun bug/RFE since many moons).

Understanding that alone is amazingly difficult if you're not familiar both and with this monstrous Java SNAFU and with ASCII/Unicode/UTF-8.

In addition to that, there's probably a more fundamental issue at play here: the XML file you're parsing should be correctly stating the encoding it is using and should be correctly encoded. Hence it should be correctly decoded by Java. Is your XML file correct? Are you decoding it from Java using the correct charset? Something like a "hexdump" of a problematic part of your XML file could help a lot here.

Here's an example as to how to proceed on a file called "problematic.txt" on a Un*x system (eg works fine on Linux and OS X as well):

$ file problematic.txt

problematic.txt: UTF-8 Unicode text

$ hexdump -C problematic.txt

00000000  6c c3 a9 61 20 31 32 33  0a                       |l..a 123.|
00000009

Maybe if you gave us more info about the problematic file people could help you more here.

Meanwhile:

http://en.wikipedia.org/wiki/ASCII

http://en.wikipedia.org/wiki/UTF-8

http://en.wikipedia.org/wiki/Unicode

score 0 · Answer 2 · answered May 28 '10 at 13:27

0

You can get rid of this exception by either add a prolog specifying the encoding or convert your XML to UTF-8 in the CLOB.

answered May 28 '10 at 13:27

ZZ Coder

74,484
29
137
169

How do I detect unicode characters in a Java string to resolve sax parser exception

2 Answers2