How to detect illegal UTF-8 byte sequences to replace them in java inputstream?

Question

The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding). I want to do my best do extract as much information as possible.

The file contains a few illegal byte sequences, those should be replaces with the replacement character.

It's not an easy task, it think it requires some knowledge about the UTF-8 state machine.

Oracle has a wrapper which does what I need:
UTF8ValidationFilter javadoc

Is there something like that available (commercially or as free software)?

Thanks
-stephan

Solution:

final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);

I hate this. content producers should produce valid content, not asking consumers to guess and correct. That has been causing so much trouble in our industry. — irreputable, Sep 27 '10 at 17:16

Henning · Accepted Answer · 2010-09-27T08:12:52.387

12

java.nio.charset.CharsetDecoder does what you need. This class provides charset decoding with user-definable actions on different kinds of errors (see onMalformedInput() and onUnmappableCharacter()).

CharsetDecoder writes to an OutputStream, which you can pipe into an InputStream using java.io.PipedOutputStream, effectively creating a filtered InputStream.

edited Sep 27 '10 at 08:12

answered Sep 27 '10 at 08:07

Henning

16,063
3
51
65

@Henning - what if I want to know on which line there bad charachters? – Dejell Dec 08 '13 at 10:24
1

@Dejel you could split the input in lines, and try to detect errors line per line. – Josep Rodríguez López Dec 09 '13 at 12:44
Yes, splitting into lines would be the way to go, but this is usually implemented at the Reader level and not at the InputStream level, so you may have to dig around a bit or write your own. – Henning Dec 09 '13 at 17:10

score 0 · Answer 2 · answered Feb 09 '16 at 11:21

The behaviour you want is already the default for InputStreamReader. So there is no need to specify it yourself. This suffices:

final BufferedInputStream in = new BufferedInputStream(istream);
final Reader inputReader = new InputStreamReader(in, StandardCharsets.UTF_8);

score 0 · Answer 3 · answered Sep 27 '10 at 15:54

0

One way would be to read the first few bytes to check Byte Order Mark (if exists). More information on BOM: http://en.wikipedia.org/wiki/Byte_order_mark In the given url, you will find a table of the BOM bytes. However, one problem is, UTF-8 does not require to use BOM in its' header. There is another way to solve the problem is by pattern recognition (read few bytes-8 bits each time). Anyway, this is the complicated solution..

answered Sep 27 '10 at 15:54

tanjir

1,294
13
22

The Problem was not a BOM, it was already remove. There ist a BOMStripperInputStream floating around, which helps here: http://code.google.com/p/train-graph/source/browse/trunk/src/org/paradise/etrc/data/BOMStripperInputStream.java?r=31 – user85155 Sep 27 '10 at 20:51

How to detect illegal UTF-8 byte sequences to replace them in java inputstream?

3 Answers3

Linked