3

I need to read a text file line by line, and apply to each of them several CharsetDecoders, in order. Actually, I first try to decode line as if it's an UTF8-encoded one, and fallback to one-byte charset if UTF8 CharsetDecoder raises MalformedInputException.

However, if I use InputStreamReader with default or specified charset, readLine function silently replaces with '?' all the bytes it thinks are invalid for the specified charset.

I finally ended up writing my own function for reading lines, that reads from a stream byte by byte, seeks for line terminators and constructs lines. But this way it appears terribly slow.

Is there any way to make Java to read lines without touching bytes?

UPDATE: I've found out that there are charsets in which all 256 bytes are valid, two of them line terminators. So it is possible to read raw byte stream line by line. Examples of such charsets are:

IBM00858 IBM437 IBM775 IBM850 IBM852 IBM855 IBM860 IBM861 IBM862 IBM863 IBM865 IBM866 ISO-8859-1 ISO-8859-13 ISO-8859-15 ISO-8859-2 ISO-8859-4 ISO-8859-5 ISO-8859-9 KOI8-R KOI8-U windows-1256

The question is now closed.

day7
  • 43
  • 5
  • "*appears really slow*" - do you mean "seems like it will be..." or "I measured it and it is..."? – Lawrence Dol Jul 06 '11 at 04:13
  • I think it's fair to assume that reading a stream byte by byte to determine encoding is measurably slower that reading it with InputStreamReader. – Paul Wheeler Jul 06 '11 at 04:21
  • @Software Monkey I run both readLine version and byte-by-byte one, and to my feelings the second is notably slower. – day7 Jul 06 '11 at 04:38
  • This question is a duplicate of this one: [Java : How to determine the correct charset encoding of a stream](http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream) – Paul Wheeler Jul 06 '11 at 04:20
  • It really isn't. I do not ask how to determine charset encoding, as I have a way to do this. I need a way to read lines without messing up with bytes. – day7 Jul 06 '11 at 04:33
  • Once you have detected the charset simply create a new InputStreamReader using the overload that takes a charset which you will specify using Charset.forName – Paul Wheeler Jul 06 '11 at 05:39

1 Answers1

0

You can't use a reader class and not expecting it to decode the underlying byte stream. If you have a file where each line is encoded in a different charset (?), then you'd better of devise a method of detecting the underlying character encoding. Perhaps you can use an encoding detector such as juniversalchardet.

  • If there exists an one-byte encoding, for which all bytes are considered as valid by InputStreamReader, and hence aren't replaced with '?', I could use it for my purpose. – day7 Jul 06 '11 at 04:51
  • And I cannot use external libraries, just core java.* classes. – day7 Jul 06 '11 at 04:56
  • Fortunately, such charsets exist! I'll update the question. And is it possible here to mark the question as closed? – day7 Jul 06 '11 at 06:15