8

I have a web application that allows users to upload their content for processing. The processing engine expects UTF8 (and I'm composing XML from multiple users' files), so I need to ensure that I can properly decode the uploaded files.

Since I'd be surprised if any of my users knew their files even were encoded, I have very little hope they'd be able to correctly specify the encoding (decoder) to use. And so, my application is left with task of detecting before decoding.

This seems like such a universal problem, I'm surprised not to find either a framework capability or general recipe for the solution. Can it be I'm not searching with meaningful search terms?

I've implemented BOM-aware detection (http://en.wikipedia.org/wiki/Byte_order_mark) but I'm not sure how often files will be uploaded w/o a BOM to indicate encoding, and this isn't useful for most non-UTF files.

My questions boil down to:

  1. Is BOM-aware detection sufficient for the vast majority of files?
  2. In the case where BOM-detection fails, is it possible to try different decoders and determine if they are "valid"? (My attempts indicate the answer is "no.")
  3. Under what circumstances will a "valid" file fail with the C# encoder/decoder framework?
  4. Is there a repository anywhere that has a multitude of files with various encodings to use for testing?
  5. While I'm specifically asking about C#/.NET, I'd like to know the answer for Java, Python and other languages for the next time I have to do this.

So far I've found:

  • A "valid" UTF-16 file with Ctrl-S characters has caused encoding to UTF-8 to throw an exception (Illegal character?) (That was an XML encoding exception.)
  • Decoding a valid UTF-16 file with UTF-8 succeeds but gives text with null characters. Huh?
  • Currently, I only expect UTF-8, UTF-16 and probably ISO-8859-1 files, but I want the solution to be extensible if possible.
  • My existing set of input files isn't nearly broad enough to uncover all the problems that will occur with live files.
  • Although the files I'm trying to decode are "text" I think they are often created w/methods that leave garbage characters in the files. Hence "valid" files may not be "pure". Oh joy.

Thanks.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
NVRAM
  • 6,947
  • 10
  • 41
  • 44
  • What makes you think UTF-8 and UTF-16 are compatible? One stores data in single byte blocks, the other in 2-byte blocks... – Matthew Scharley Feb 22 '10 at 21:01
  • BOM is mostly used on Microsoft OSes, Unices prefer encoding withoug BOM. – Vlad Feb 22 '10 at 21:04
  • Whether or not `Ctrl-S` character is allowed, doesn't depend on your format. Both UTF-8 and UTF-16 are able to encode `Ctrl-S`, just for the software (which uses the obtained UTF-8) this character can be unexpected. – Vlad Feb 22 '10 at 21:06
  • If you decode a UTF-16 file _as if_ it were UTF-8, of course you'll get null characters at each second byte. The story is that the character `0` gets encoded in UTF-8 to byte `0x30`, but in UTF-16 as two bytes `0x30 0x00` (`0x0030`). – Vlad Feb 22 '10 at 21:09
  • @Matthew: Are you suggesting I can mix UTF-8 and UTF-16 encoded strings in the same XML file? @Vlad - it wasn't any specific software that died it was actually the C# core *System.Xml.Linq.XElement.WriteTo(XmlWriter)* method which threw the exception (although the code has changed so I can't reproduce the error w/o a lot of work). – NVRAM Feb 22 '10 at 21:14
  • @NVRAM: No you can't successfully mix any two encodings in the same file, except where one entirely overlaps another (UTF-8 and ASCII for instance). UTF-8 and UTF-16 are completely different, but you seem to think from your question that one should be able to decode the other successfully. – Matthew Scharley Feb 22 '10 at 21:17
  • That said, you can read both types of files *as text* and combine them in C#, and then when you output them, they will be reencoded to whatever format your output file is in. – Matthew Scharley Feb 22 '10 at 21:19
  • @Matthew - not sure what part of my OP sounds like I thought the decoders are interoperable, it's the whole point of my post! But your last comment is on track with my intent (plus a step of saving it to a DB), but I first must decode them properly. – NVRAM Feb 22 '10 at 21:42

5 Answers5

3

There won't be an absolutely reliable way, but you may be able to get "pretty good" result with some heuristics.

  • If the data starts with a BOM, use it.
  • If the data contains 0-bytes, it is likely utf-16 or ucs-32. You can distinguish between these, and between the big-endian and little-endian variants of these by looking at the positions of the 0-bytes
  • If the data can be decoded as utf-8 (without errors), then it is very likely utf-8 (or US-ASCII, but this is a subset of utf-8)
  • Next, if you want to go international, map the browser's language setting to the most likely encoding for that language.
  • Finally, assume ISO-8859-1

Whether "pretty good" is "good enough" depends on your application, of course. If you need to be sure, you might want to display the results as a preview, and let the user confirm that the data looks right. If it doesn't, try the next likely encoding, until the user is satisfied.

Note: this algorithm will not work if the data contains garbage characters. For example, a single garbage byte in otherwise valid utf-8 will cause utf-8 decoding to fail, making the algorithm go down the wrong path. You may need to take additional measures to handle this. For example, if you can identify possible garbage beforehand, strip it before you try to determine the encoding. (It doesn't matter if you strip too aggressive, once you have determined the encoding, you can decode the original unstripped data, just configure the decoders to replace invalid characters instead of throwing an exception.) Or count decoding errors and weight them appropriately. But this probably depends much on the nature of your garbage, i.e. what assumptions you can make.

oefe
  • 19,298
  • 7
  • 47
  • 66
  • 1
    This is helpful, although note that some UTF16LE files I had decoded *without exceptions* by the C#/.NET encoding framework; there were *errors* (null characters) but no *exceptions*. My intention is auto-detection (hence the posting) and I've partially implemented it since I already detect MSWord, PDF and other non-text files, but the issue is determining when an encoding is the *right* one. – NVRAM Feb 24 '10 at 22:46
  • You're right, the 0-Byte check needs to go first, I fixed the order of steps in my answer accordingly – oefe Feb 25 '10 at 20:32
2

Have you tried reading a representative cross-section of your files from user, running them through your program, testing, correcting any errors and moving on?

I've found File.ReadAllLines() pretty effective across a very wide range of applications without worrying about all of the encodings. It seems to handle it pretty well.

Xmlreader() has done fairly well once I figured out how to use it properly.

Maybe you could post some specific examples of data and get some better responses.

No Refunds No Returns
  • 8,092
  • 4
  • 32
  • 43
  • Thanks, but I'm looking for a general purpose solution. In this application, the app is deployed at the customer's site and I don't have access (or legal permission) to the files. They are *any* text document the user wishes to upload. Some are PDF-to-text, some are scraped from web sites, some are from PPT slides, some are.... who knows. – NVRAM Feb 22 '10 at 21:18
  • Then I would say make sure you have extensive logging about input/output etc written to the users local event log. This sounds like a no-win situation to me. – No Refunds No Returns Feb 22 '10 at 21:44
  • Incidentally, I don't get what you mean by *correcting any errors and moving on* -- I cannot "correct" the user's files, and the error I now have is that they must correctly select the encoding format. I'll look into **File.ReadAllLines()**... – NVRAM Feb 22 '10 at 21:49
  • Is the encoding-detection capabilities of **File.ReadAllLines()** available for Streams? – NVRAM Feb 23 '10 at 00:51
1

This is a well known problem. You can try to do what Internet Explorer is doing. This is a nice article in The CodeProject that describes Microsoft's solution to the problem. However no solution is 100% accurate as everything is based on heuristcs. And it is also no safe to assume that a BOM will be present.

kgiannakakis
  • 103,016
  • 27
  • 158
  • 194
1

You may like to look at a Python-based solution called chardet. It's a Python port of Mozilla code. Although you may not be able to use it directly, its documentation is well worth reading, as is the original Mozilla article it references.

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • FWIW, I grabbed UDE [http://code.google.com/p/ude/] compiled it with Mono. I then ran the resulting EXE against files that were encoded ISO-8859-1, -2, UTF-{8,16LE,16BE,32LE,32BE} and it only recognized the UTF-8 correctly (guessed windows-1255 or -1252 for everything else). – NVRAM Mar 30 '10 at 23:40
  • It won't recognise UTF-nnxE without a BOM; did yours have a BOM? ISO-8859-n is a figment of the imagination -- decode it to Unicode and see if you have any characters in the range U+0080 to U+009F ;-) – John Machin Mar 31 '10 at 00:46
0

I ran into a similar issue. I needed a powershell script that figured out if a file was text-encoded ( in any common encoding ) or not.

It's definitely not exhaustive, but here's my solution...

PowerShell search script that ignores binary files

Community
  • 1
  • 1
kervin
  • 11,672
  • 5
  • 42
  • 59