How can I identify different encodings without the use of a BOM?

Question

I have a file watcher that is grabbing content from a growing file encoded with utf-16LE. The first bit of data written to it has the BOM available -- I was using this to identify the encoding against UTF-8 (which MOST of my files coming in are encoded in). I catch the BOM and re-encode to UTF-8 so my parser doesn't freak out. The problem is that since it's a growing file not every bit of data has the BOM in it.

Here's my question -- without prepending the BOM bytes to each set of data I have (because I don't have control on the source) can I can just look for null bytes that are inherent in UTF-16 \000, and then use that as my identifier instead of the BOM? Will this cause me headaches down the road?

My architecture involves a ruby web application logging the received data to a temporary file when my parser written in java picks it up.

Write now my identification/re-encoding code looks like this:

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);

    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      String asString = new String(contents, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

UPDATE

I want to support stuff like euros, em-dashes, and other characters as such. I modified the above code to look like this and it seems to pass all my tests for those characters:

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);
    byte[] real = null;

    int found = 0;

    // if found a BOM then skip out of here... we just need to convert it
    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      found = 3;
      real = contents;

    // no BOM detected but still could be UTF-16
    } else {

      for(int cnt=0; cnt<10; cnt++) {
        if(contents[cnt] == (byte)0x00) { found++; };

        real = new byte[contents.length+2];
        real[0] = (byte)0xFF;
        real[1] = (byte)0xFE;

        // tack on BOM and copy over new array
        for(int ib=2; ib < real.length; ib++) {
          real[ib] = contents[ib-2];
        }
      }

    }

    if(found >= 2) {
      String asString = new String(real, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

What do you all think?

I don't understand the problem (you refer to in the first paragraph). Of course not every fragment will have the BOM - but surely the beginning of the file. So for each file, remember whether you have seen the BOM, and if so, process it as UTF-16. — Martin v. Löwis, Aug 28 '09 at 00:38
I don't have control of the file that is growing -- that is why I can not simply rewind the file. I don't have mixed encodings within the same file -- but I DO have mixed encodings going across a network to a parser that accepts mixed encodings. — eyberg, Aug 28 '09 at 00:55
@feydr, you may not have control of the *file* but you do have control of your *file handle*. Seeking to the start of your file with your file handle will not affect anyone else. — paxdiablo, Aug 28 '09 at 01:13
@Pax, that is under the assumption that I am expecting each source to write to the same file -- this is definitely not the case at all -- individual bits of data drop into a queue from many sources -- I'm thinking about using 'human intelligence' knowing that a certain source (operating system/program) will produce a certain encoding and using that information to determine which encoding I want to use... — eyberg, Aug 28 '09 at 01:19
I have no control of the file nor it's file handle -- the data is taken and shot across HTTP -- it winds up as a new file still encoded the same way but sometimes without the BOM — eyberg, Aug 28 '09 at 01:21
Then, no, you cannot reliably detect the encoding, unless you're certain the characters will be limited to a subset. Then you just analyze the data block to discount illegal encoding for the given subset. But even that won't be 100% reliable. — paxdiablo, Aug 28 '09 at 05:45
@feydr: what I think is that what you have implemented is "fragile". A more reliable solution is to capture and use the character encoding information from the HTTP POST requests. — Stephen C, Aug 29 '09 at 06:11

Stephen C · Answer 1 · 2009-08-28T03:47:28.640

In general, you cannot identify the character encoding of a data stream with 100% accuracy. The best you can do is try to decode using a limited set of expected encodings, and then apply some heuristics to the decoded result to see if it "looks like" text in the expected language. (But any heuristic will give false positives and false negatives for certain data streams.) Alternatively, put a human in the loop to decide which decoding makes the most sense.

A better solution is to to redesign your protocol so that whatever is supplying the data has to also supply the encoding scheme used for the data. (And if you cannot, blame whoever is responsible for designing / implementing the system that cannot give you an encoding scheme!).

EDIT: from your comments on the question, the data files are being delivered via HTTP. In this case, you should arrange that your HTTP server snarfs the "content-type" header of the POST requests delivering the data, extract the character set / encoding from the header, and save it in a way / place that your file parser can deal with.

score 0 · Answer 2 · answered Aug 28 '09 at 00:50

0

This will cause you headaches down the road, no doubt about it. You can check for alternating zero bytes for the simplistic cases (ASCII only, UTF-16, either byte order) but the minute you start getting a stream of characters above the 0x7f code point, that method becomes useless.

If you have the file handle, the best bet is to save the current file pointer, seek to the start, read the BOM then seek back to the original position.

Either that or remember the BOM somehow.

Relying on the data contents is a bad idea unless you're absolutely certain the character range will be restricted for all inputs.

answered Aug 28 '09 at 00:50

paxdiablo

854,327
234
1,573
1,953

Relying on the BOM is a worse idea unless you're absolutely certain that the file will have one. – dan04 Aug 14 '10 at 21:54
"The first bit of data written to it has the BOM available" was in the question so I _was_ absolutely certain :-) – paxdiablo Aug 15 '10 at 00:48

score 0 · Answer 3 · edited May 23 '17 at 12:30

0

This question contains a few options for character detection which don't appear to require a BOM.

My project is currently using jCharDet but I might need to look at some of the other options listed there as jCharDet is not 100% reliable.

edited May 23 '17 at 12:30

Community

1
1

answered Aug 28 '09 at 05:15

jwaddell

1,892
1
23
32

2

@jwaddell: No character detection scheme is going to be 100% reliable. – Stephen C Aug 28 '09 at 05:29

How can I identify different encodings without the use of a BOM?

3 Answers3

Linked