55

I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the field values are in different encodings, so there is no encoding in which the whole file is valid, but I need to parse it anyway. Apart from using java libraries like Weka, I am mainly working in Scala. I am not even able to read the file usin scala.io.Source: For example

Source.
  fromFile(filename)("UTF-8").
  foreach(print);

throws:

    java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:337)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:176)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:153)
at java.io.BufferedReader.read(BufferedReader.java:174)
at scala.io.BufferedSource$$anonfun$iter$1$$anonfun$apply$mcI$sp$1.apply$mcI$sp(BufferedSource.scala:38)
at scala.io.Codec.wrap(Codec.scala:64)
at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38)
at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38)
at scala.collection.Iterator$$anon$14.next(Iterator.scala:150)
at scala.collection.Iterator$$anon$25.hasNext(Iterator.scala:562)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400)
at scala.io.Source.hasNext(Source.scala:238)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at scala.io.Source.foreach(Source.scala:181)

I am perfectly happy to throw all the invalid characters away or replace them with some dummy. I am going to have lots of text like this to process in various ways and may need to pass the data to various third party libraries. An ideal solution would be some kind of global setting that would cause all the low level java libraries to ignore invalid bytes in text, so that that I can call third party libraries on this data without modification.

SOLUTION:

import java.nio.charset.CodingErrorAction
import scala.io.Codec

implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)

val src = Source.
  fromFile(filename).
  foreach(print)

Thanks to +Esailija for pointing me in the right direction. This lead me to How to detect illegal UTF-8 byte sequences to replace them in java inputstream? which provides the core java solution. In Scala I can make this the default behaviour by making the codec implicit. I think I can make it the default behaviour for the entire package by putting it the implicit codec definition in the package object.

Community
  • 1
  • 1
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
  • 2
    Somewhere in that mess the `CodingErrorAction` of a `CharsetDecoder` must be set to `IGNORE` or `REPLACE` – Esailija Nov 29 '12 at 11:52
  • +Esailija That is the the kind of solution I have in mind. The Python scikit library some of the text processing function take this option as a parameter. I just have not seen anything for setting this in Java/Scala apis. – Daniel Mahler Nov 29 '12 at 12:38
  • I have used a hand made solution in my answer, I don't know anything about java or scala either – Esailija Nov 29 '12 at 13:06

8 Answers8

29

This is how I managed to do it with java:

    FileInputStream input;
    String result = null;
    try {
        input = new FileInputStream(new File("invalid.txt"));
        CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
        decoder.onMalformedInput(CodingErrorAction.IGNORE);
        InputStreamReader reader = new InputStreamReader(input, decoder);
        BufferedReader bufferedReader = new BufferedReader( reader );
        StringBuilder sb = new StringBuilder();
        String line = bufferedReader.readLine();
        while( line != null ) {
            sb.append( line );
            line = bufferedReader.readLine();
        }
        bufferedReader.close();
        result = sb.toString();

    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch( IOException e ) {
        e.printStackTrace();
    }

    System.out.println(result);

The invalid file is created with bytes:

0x68, 0x80, 0x65, 0x6C, 0x6C, 0xC3, 0xB6, 0xFE, 0x20, 0x77, 0xC3, 0xB6, 0x9C, 0x72, 0x6C, 0x64, 0x94

Which is hellö wörld in UTF-8 with 4 invalid bytes mixed in.

With .REPLACE you see the standard unicode replacement character being used:

//"h�ellö� wö�rld�"

With .IGNORE, you see the invalid bytes ignored:

//"hellö wörld"

Without specifying .onMalformedInput, you get

java.nio.charset.MalformedInputException: Input length = 1
    at java.nio.charset.CoderResult.throwException(Unknown Source)
    at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
    at sun.nio.cs.StreamDecoder.read(Unknown Source)
    at java.io.InputStreamReader.read(Unknown Source)
    at java.io.BufferedReader.fill(Unknown Source)
    at java.io.BufferedReader.readLine(Unknown Source)
    at java.io.BufferedReader.readLine(Unknown Source)
Esailija
  • 138,174
  • 23
  • 272
  • 326
  • That is what I had in mind. When I saw your initial comment I seached for CodingErrorAction and CharsetDecoder and found a similar solution in another question. I really want this to be the default behavior in my package and in Scala I can do that with implicits (not sure if that is possible in Java). Thanks for your help! – Daniel Mahler Nov 29 '12 at 13:52
17

Scala's Codec has a decoder field which returns a java.nio.charset.CharsetDecoder:

val decoder = Codec.UTF8.decoder.onMalformedInput(CodingErrorAction.IGNORE)
Source.fromFile(filename)(decoder).getLines().toList
maxmc
  • 4,218
  • 1
  • 25
  • 21
15

The solution for scala's Source (based on @Esailija answer):

def toSource(inputStream:InputStream): scala.io.BufferedSource = {
    import java.nio.charset.Charset
    import java.nio.charset.CodingErrorAction
    val decoder = Charset.forName("UTF-8").newDecoder()
    decoder.onMalformedInput(CodingErrorAction.IGNORE)
    scala.io.Source.fromInputStream(inputStream)(decoder)
}
raisercostin
  • 8,777
  • 5
  • 67
  • 76
2

The problem with ignoring invalid bytes is then deciding when they're valid again. Note that UTF-8 allows variable-length byte encodings for characters, so if a byte is invalid, you need to understand which byte to start reading from to get a valid stream of characters again.

In short, I don't think you'll find a library which can 'correct' as it reads. I think a much more productive approach is to try and clean that data up first.

Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
  • 1
    AFAIK bytes that make up a multibyte character have inial bits which say _I am the first byte of a multibyte character_ or _I am the second byte of a multibyte character_ etc, so it should be possible to throw away bytes until you either get a valid single byte character or the firs byte of a multibyte character. I can read these files in R & in Python. – Daniel Mahler Nov 29 '12 at 11:53
  • That's a good point. However, do you *know* they're correct ? – Brian Agnew Nov 29 '12 at 12:00
  • If you mean the files parsed in R & Python then yes. – Daniel Mahler Nov 29 '12 at 12:15
2

I'm switching to a different codec if one fails.

In order to implement the pattern, I got inspiration from this other stackoverflow question.

I use a default List of codecs, and recursively go through them. If they all fail, I print out the scary bits:

private val defaultCodecs = List(
  io.Codec("UTF-8"),
  io.Codec("ISO-8859-1")
)

def listLines(file: java.io.File, codecs:Iterable[io.Codec] = defaultCodecs): Iterable[String] = {
  val codec = codecs.head
  val fileHandle = scala.io.Source.fromFile(file)(codec)
  try {
    val txtArray = fileHandle.getLines().toList
    txtArray
  } catch {
    case ex: Exception => {
      if (codecs.tail.isEmpty) {
        println("Exception:  " + ex)
        println("Skipping file:  " + file.getPath)
        List()
      } else {
        listLines(file, codecs.tail)
      }
    }
  } finally {
    fileHandle.close()
  }
}

I'm just learning Scala, so the code may not be optimal.

Community
  • 1
  • 1
Harry Pehkonen
  • 3,038
  • 2
  • 17
  • 19
0

A simple solution would be to interpret your data stream as ASCII, ignore all non-text characters. However, you would lose even valid encoded UTF8-characters. Don't know if that is acceptable for you.

EDIT: If you know in advance which columns are valid UTF-8, you could write your own CSV parser that can be configured which strategy to use on what column.

mbelow
  • 1,093
  • 6
  • 11
0

Use ISO-8859-1 as the encoder; this will just give you byte values packed into a string. This is enough to parse CSV for most encodings. (If you have mixed 8-bit and 16-bit blocks, then you're in trouble; you can still read the lines in ISO-8859-1, but you may not be able to parse the line as a block.)

Once you have the individual fields as separate strings, you can try

new String(oldstring.getBytes("ISO-8859-1"), "UTF-8")

to generate the string with the proper encoding (use the appropriate encoding name per field, if you know it).

Edit: you will have to use java.nio.charset.Charset.CharsetDecoder if you want to detect errors. Mapping to UTF-8 this way will just give you 0xFFFF in your string when there's an error.

val decoder = java.nio.charset.Charset.forName("UTF-8").newDecoder

// By default will throw a MalformedInputException if encoding fails
decoder.decode( java.nio.ByteBuffer.wrap(oldstring.getBytes("ISO-8859-1")) ).toString
Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
  • This is also a viable solution for me. I did not know that the ISO-8859-1 codec will accept arbitrary bytes. I think in Python (or some other language I have used ISO-8859-1 throws an error on on ascii chars. – Daniel Mahler Nov 29 '12 at 15:34
0

If you're working with scala, You can handle character encoding issues with:

import scala.io.Codec
implicit val codec: Codec = Codec("ISO-8859-1")
helvete
  • 2,455
  • 13
  • 33
  • 37
Muddit
  • 1
  • 1