5

I'm working on a 1 Gigabyte JSON text file which I'm trying to parse using Java. However, the parser throws an exception because it runs into the character 'ñ' generating this exception:

Exception Invalid UTF-8 start byte 0x96

I've tried to remove the character using sed and perl, but it seems that they cannot read the character and thus the file remains unchanged. I'd like to remove the character from the whole file or replace it with any other character or string so that the parsing works.

SomeKittens
  • 38,868
  • 19
  • 114
  • 143
user1261046
  • 169
  • 4
  • 13

2 Answers2

5

Your file is not encoded in UTF-8.

You should find the encoding and use this encoding to read the File using InputStreamReader. And then save it if needed in UTF-8 (using for exemple an OutputStreamWriter).

If you don't know the encoding, I suggest you test with a few probable encodings : see Charsets.

Denys Séguret
  • 372,613
  • 87
  • 782
  • 758
2

Yes it may not be UTF-8 see here for some information on how to check what encoding it is: Java : How to determine the correct charset encoding of a stream

with the best answering seeming to point towards the InputStreamReader#getEncoding()

Community
  • 1
  • 1
David Kroukamp
  • 36,155
  • 13
  • 81
  • 138