16

The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding). I want to do my best do extract as much information as possible.

The file contains a few illegal byte sequences, those should be replaces with the replacement character.

It's not an easy task, it think it requires some knowledge about the UTF-8 state machine.

Oracle has a wrapper which does what I need:
UTF8ValidationFilter javadoc

Is there something like that available (commercially or as free software)?

Thanks
-stephan

Solution:

final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);
Thomas S.
  • 5,804
  • 5
  • 37
  • 72
user85155
  • 1,370
  • 16
  • 24
  • 10
    I hate this. content producers should produce valid content, not asking consumers to guess and correct. That has been causing so much trouble in our industry. – irreputable Sep 27 '10 at 17:16

3 Answers3

12

java.nio.charset.CharsetDecoder does what you need. This class provides charset decoding with user-definable actions on different kinds of errors (see onMalformedInput() and onUnmappableCharacter()).

CharsetDecoder writes to an OutputStream, which you can pipe into an InputStream using java.io.PipedOutputStream, effectively creating a filtered InputStream.

Henning
  • 16,063
  • 3
  • 51
  • 65
0

The behaviour you want is already the default for InputStreamReader. So there is no need to specify it yourself. This suffices:

final BufferedInputStream in = new BufferedInputStream(istream);
final Reader inputReader = new InputStreamReader(in, StandardCharsets.UTF_8);
Joe23
  • 5,683
  • 3
  • 25
  • 23
0

One way would be to read the first few bytes to check Byte Order Mark (if exists). More information on BOM: http://en.wikipedia.org/wiki/Byte_order_mark In the given url, you will find a table of the BOM bytes. However, one problem is, UTF-8 does not require to use BOM in its' header. There is another way to solve the problem is by pattern recognition (read few bytes-8 bits each time). Anyway, this is the complicated solution..

tanjir
  • 1,294
  • 13
  • 22
  • The Problem was not a BOM, it was already remove. There ist a BOMStripperInputStream floating around, which helps here: http://code.google.com/p/train-graph/source/browse/trunk/src/org/paradise/etrc/data/BOMStripperInputStream.java?r=31 – user85155 Sep 27 '10 at 20:51