How to read a text file with mixed encodings in Scala or Java?

Question

I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the field values are in different encodings, so there is no encoding in which the whole file is valid, but I need to parse it anyway. Apart from using java libraries like Weka, I am mainly working in Scala. I am not even able to read the file usin scala.io.Source: For example

Source.
  fromFile(filename)("UTF-8").
  foreach(print);

throws:

    java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:337)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:176)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:153)
at java.io.BufferedReader.read(BufferedReader.java:174)
at scala.io.BufferedSource$$anonfun$iter$1$$anonfun$apply$mcI$sp$1.apply$mcI$sp(BufferedSource.scala:38)
at scala.io.Codec.wrap(Codec.scala:64)
at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38)
at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38)
at scala.collection.Iterator$$anon$14.next(Iterator.scala:150)
at scala.collection.Iterator$$anon$25.hasNext(Iterator.scala:562)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400)
at scala.io.Source.hasNext(Source.scala:238)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at scala.io.Source.foreach(Source.scala:181)

I am perfectly happy to throw all the invalid characters away or replace them with some dummy. I am going to have lots of text like this to process in various ways and may need to pass the data to various third party libraries. An ideal solution would be some kind of global setting that would cause all the low level java libraries to ignore invalid bytes in text, so that that I can call third party libraries on this data without modification.

SOLUTION:

import java.nio.charset.CodingErrorAction
import scala.io.Codec

implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)

val src = Source.
  fromFile(filename).
  foreach(print)

Thanks to +Esailija for pointing me in the right direction. This lead me to How to detect illegal UTF-8 byte sequences to replace them in java inputstream? which provides the core java solution. In Scala I can make this the default behaviour by making the codec implicit. I think I can make it the default behaviour for the entire package by putting it the implicit codec definition in the package object.

Somewhere in that mess the `CodingErrorAction` of a `CharsetDecoder` must be set to `IGNORE` or `REPLACE` — Esailija, Nov 29 '12 at 11:52
+Esailija That is the the kind of solution I have in mind. The Python scikit library some of the text processing function take this option as a parameter. I just have not seen anything for setting this in Java/Scala apis. — Daniel Mahler, Nov 29 '12 at 12:38
I have used a hand made solution in my answer, I don't know anything about java or scala either — Esailija, Nov 29 '12 at 13:06

score 29 · Accepted Answer · answered Nov 29 '12 at 13:04

This is how I managed to do it with java:

    FileInputStream input;
    String result = null;
    try {
        input = new FileInputStream(new File("invalid.txt"));
        CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
        decoder.onMalformedInput(CodingErrorAction.IGNORE);
        InputStreamReader reader = new InputStreamReader(input, decoder);
        BufferedReader bufferedReader = new BufferedReader( reader );
        StringBuilder sb = new StringBuilder();
        String line = bufferedReader.readLine();
        while( line != null ) {
            sb.append( line );
            line = bufferedReader.readLine();
        }
        bufferedReader.close();
        result = sb.toString();

    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch( IOException e ) {
        e.printStackTrace();
    }

    System.out.println(result);

The invalid file is created with bytes:

0x68, 0x80, 0x65, 0x6C, 0x6C, 0xC3, 0xB6, 0xFE, 0x20, 0x77, 0xC3, 0xB6, 0x9C, 0x72, 0x6C, 0x64, 0x94

Which is hellö wörld in UTF-8 with 4 invalid bytes mixed in.

With .REPLACE you see the standard unicode replacement character being used:

//"h�ellö� wö�rld�"

With .IGNORE, you see the invalid bytes ignored:

//"hellö wörld"

Without specifying .onMalformedInput, you get

java.nio.charset.MalformedInputException: Input length = 1
    at java.nio.charset.CoderResult.throwException(Unknown Source)
    at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
    at sun.nio.cs.StreamDecoder.read(Unknown Source)
    at java.io.InputStreamReader.read(Unknown Source)
    at java.io.BufferedReader.fill(Unknown Source)
    at java.io.BufferedReader.readLine(Unknown Source)
    at java.io.BufferedReader.readLine(Unknown Source)

That is what I had in mind. When I saw your initial comment I seached for CodingErrorAction and CharsetDecoder and found a similar solution in another question. I really want this to be the default behavior in my package and in Scala I can do that with implicits (not sure if that is possible in Java). Thanks for your help! — Daniel Mahler, Nov 29 '12 at 13:52

score 17 · Answer 2 · answered Aug 31 '15 at 14:40

17

Scala's Codec has a decoder field which returns a java.nio.charset.CharsetDecoder:

val decoder = Codec.UTF8.decoder.onMalformedInput(CodingErrorAction.IGNORE)
Source.fromFile(filename)(decoder).getLines().toList

answered Aug 31 '15 at 14:40

maxmc

4,218
1
25
21

2

Right. That is the essence of Esailija and raisercostin's answers. This is the most concise though. – Daniel Mahler Aug 31 '15 at 18:18

score 15 · Answer 3 · answered Sep 22 '14 at 15:03

The solution for scala's Source (based on @Esailija answer):

def toSource(inputStream:InputStream): scala.io.BufferedSource = {
    import java.nio.charset.Charset
    import java.nio.charset.CodingErrorAction
    val decoder = Charset.forName("UTF-8").newDecoder()
    decoder.onMalformedInput(CodingErrorAction.IGNORE)
    scala.io.Source.fromInputStream(inputStream)(decoder)
}

score 2 · Answer 4 · answered Nov 29 '12 at 11:46

2

The problem with ignoring invalid bytes is then deciding when they're valid again. Note that UTF-8 allows variable-length byte encodings for characters, so if a byte is invalid, you need to understand which byte to start reading from to get a valid stream of characters again.

In short, I don't think you'll find a library which can 'correct' as it reads. I think a much more productive approach is to try and clean that data up first.

answered Nov 29 '12 at 11:46

Brian Agnew

268,207
37
334
440

1

AFAIK bytes that make up a multibyte character have inial bits which say _I am the first byte of a multibyte character_ or _I am the second byte of a multibyte character_ etc, so it should be possible to throw away bytes until you either get a valid single byte character or the firs byte of a multibyte character. I can read these files in R & in Python. – Daniel Mahler Nov 29 '12 at 11:53
That's a good point. However, do you *know* they're correct ? – Brian Agnew Nov 29 '12 at 12:00
If you mean the files parsed in R & Python then yes. – Daniel Mahler Nov 29 '12 at 12:15

score 2 · Answer 5 · edited May 23 '17 at 11:47

I'm switching to a different codec if one fails.

In order to implement the pattern, I got inspiration from this other stackoverflow question.

I use a default List of codecs, and recursively go through them. If they all fail, I print out the scary bits:

private val defaultCodecs = List(
  io.Codec("UTF-8"),
  io.Codec("ISO-8859-1")
)

def listLines(file: java.io.File, codecs:Iterable[io.Codec] = defaultCodecs): Iterable[String] = {
  val codec = codecs.head
  val fileHandle = scala.io.Source.fromFile(file)(codec)
  try {
    val txtArray = fileHandle.getLines().toList
    txtArray
  } catch {
    case ex: Exception => {
      if (codecs.tail.isEmpty) {
        println("Exception:  " + ex)
        println("Skipping file:  " + file.getPath)
        List()
      } else {
        listLines(file, codecs.tail)
      }
    }
  } finally {
    fileHandle.close()
  }
}

I'm just learning Scala, so the code may not be optimal.

score 0 · Answer 6 · answered Nov 29 '12 at 11:51

A simple solution would be to interpret your data stream as ASCII, ignore all non-text characters. However, you would lose even valid encoded UTF8-characters. Don't know if that is acceptable for you.

EDIT: If you know in advance which columns are valid UTF-8, you could write your own CSV parser that can be configured which strategy to use on what column.

Rex Kerr · Answer 7 · 2012-11-29T12:25:04.383

Use ISO-8859-1 as the encoder; this will just give you byte values packed into a string. This is enough to parse CSV for most encodings. (If you have mixed 8-bit and 16-bit blocks, then you're in trouble; you can still read the lines in ISO-8859-1, but you may not be able to parse the line as a block.)

Once you have the individual fields as separate strings, you can try

new String(oldstring.getBytes("ISO-8859-1"), "UTF-8")

to generate the string with the proper encoding (use the appropriate encoding name per field, if you know it).

Edit: you will have to use java.nio.charset.Charset.CharsetDecoder if you want to detect errors. Mapping to UTF-8 this way will just give you 0xFFFF in your string when there's an error.

val decoder = java.nio.charset.Charset.forName("UTF-8").newDecoder

// By default will throw a MalformedInputException if encoding fails
decoder.decode( java.nio.ByteBuffer.wrap(oldstring.getBytes("ISO-8859-1")) ).toString

This is also a viable solution for me. I did not know that the ISO-8859-1 codec will accept arbitrary bytes. I think in Python (or some other language I have used ISO-8859-1 throws an error on on ascii chars. — Daniel Mahler, Nov 29 '12 at 15:34

score 0 · Answer 8 · edited Jun 30 '22 at 09:45

0

If you're working with scala, You can handle character encoding issues with:

import scala.io.Codec
implicit val codec: Codec = Codec("ISO-8859-1")

edited Jun 30 '22 at 09:45

helvete

2,455
13
33
37

answered Jun 27 '22 at 09:31

Muddit

1
1

How to read a text file with mixed encodings in Scala or Java?

8 Answers8

Linked