4

I am reading a text file in my program which contains some Unicode BOM character \ufeff/65279 in places. This presents several issues in further parsing.

Right now I am detecting and filtering these characters myself but would like to know if Java standard library or Guava has a way to do this more cleanly.

missingfaktor
  • 90,905
  • 62
  • 285
  • 365
  • 1
    In _places_? The BOM should be the first bytes of a file; otherwise it isn't a BOM. – Boris the Spider Apr 13 '13 at 08:43
  • 2
    Assuming that the BOM is at the start of the file then [this](http://code.google.com/p/guava-libraries/issues/detail?id=345&colspec=ID%20Type%20Status%20Milestone%20Summary) bug report of the Guava website explains that Guava doesn't handle BOM and [this](http://stackoverflow.com/questions/9736999/how-to-remove-bom-from-an-xml-file-in-java) post gives an idea on how to skip it in plain Java. – Boris the Spider Apr 13 '13 at 08:51
  • @bmorris591, yes, in the beginning. Thanks. If you post your 2nd comment as an answer, I will mark it accepted. – missingfaktor Apr 13 '13 at 09:29

1 Answers1

10

There is no built in way of dealing with a (UTF-8) BOM in Java or, indeed, in Guava.

There is currently a bug report on the Guava website about dealing with a BOM in Guava IO.

There are several SO posts (here and here) on how to detect/skip the BOM while reading a file in plain Java.

Your BOM (\ufeff) seems to be UTF-16 which, according to the same Guava report should be dealt with automatically by Java. This SO post seems suggest the same.

Community
  • 1
  • 1
Boris the Spider
  • 59,842
  • 6
  • 106
  • 166