14

I am working on a codebase which has some unicode encoded files scattered throughout as a result of multiple team members developing with different editors (and default settings). I would like to clean up our code base by finding all the unicode encoded files and converting them back to ANSI encoding.

Any thoughts on how to accomplish the "finding" part of this task would be truly appreciated.

HOCA
  • 1,073
  • 2
  • 9
  • 20
  • What programming language are you using? I suppose a small VBS script can suffice for this task. – LostInTheCode Jan 12 '11 at 18:49
  • We're using c#, but I was looking more for a tool that I could use to search for unicode encoded files. What would you look for in the text files to identify them as unicode? – HOCA Jan 12 '11 at 19:04

5 Answers5

6

See “How to detect the character encoding of a text-file?” or “How to reliably guess the encoding [...]?”

  • UTF-8 can be detected with validation. You can also look for the BOM EF BB BF, but don't rely on it.
  • UTF-16 can be detected by looking for the BOM.
  • UTF-32 can be detected by validation, or by the BOM.
  • Otherwise assume the ANSI code page.

Our codebase doesn't include any non-ASCII chars. I will try to grep for the BOM in files in our codebase. Thanks for the clarification.

Well that makes things a lot simpler. UTF-8 without non-ASCII chars is ASCII.

Community
  • 1
  • 1
dan04
  • 87,747
  • 23
  • 163
  • 198
  • 1
    what do you mean by " can be detected with validation" - what kind of validation are you referring to? Thanks! – LearnByReading Nov 25 '15 at 15:00
  • I mean checking that the data consists solely of valid UTF-8 byte sequences. For example, `F0 9F 92 A9` is valid UTF-8, but `F5 9F 92 A9` is not. – dan04 Nov 26 '15 at 00:59
5

Unicode is a standard, it is not an encoding. There are many encodings that implement Unicode, including UTF-8, UTF-16, UCS-2, and others. The translation of any of these encodings to ASCII depends entirely on what encoding your "different editors" use.

Some editors insert byte-order marks of BOMs at the start of Unicode files. If your editors do that, you can use them to detect the encoding.

ANSI is a standards body that has published several encodings for digital character data. The "ANSI" encoding used by MS DOS and supported in Windows is actually CP-1252, not an ANSI standard.

Does your codebase include non-ASCII characters? You may have better compatibility using a Unicode encoding rather than an ANSI one or CP-1252.

Dour High Arch
  • 21,513
  • 29
  • 75
  • 90
  • Our codebase doesn't include any non-ASCII chars. I will try to grep for the BOM in files in our codebase. Thanks for the clarification. – HOCA Jan 12 '11 at 20:44
  • 2
    There is no single Windows 8-bit (aka ANSI) encoding, there are many, such as CP1251, CP1252, CP1253 and so on. Also see this question: http://stackoverflow.com/questions/3864240/default-code-page-for-each-language-version-of-windows – dalle Jan 12 '11 at 22:12
  • 1
    @HOCA, if your files contain only ASCII, it is already in UTF-8 and does not need "converting". – Dour High Arch Jan 13 '11 at 18:00
2

Actually, if you want to find out in windows if a file is unicode, simply run findstr on the file for a string you know is in there.

findstr /I /C:"SomeKnownString" file.txt

It will come back empty. Then to be sure, run findstr on a letter or digit you know is in the file:

FindStr /I /C:"P" file.txt

You will probably get many occurrences and the key is that they will be spaced apart. This is a sign the file is unicode and not ascii.

Hope this helps.

John
  • 21
  • 1
1

If you're looking for a programmatic solution, IsTextUnicode() might be an option.

Luke
  • 11,211
  • 2
  • 27
  • 38
  • 2
    That API function is problematic: http://blogs.msdn.com/b/michkap/archive/2005/01/30/363308.aspx – Nemanja Trifunovic Jan 12 '11 at 22:33
  • It doesn't even support UTF-8. – dan04 Jan 13 '11 at 01:00
  • While it's not perfect, IsTextUnicode is exactly what Notepad uses to differentiate between Unicode and ANSI/UTF8. It looks for the BOM header in a file. Failing that, its got some statistical inference algorithm. But you're on your own for detecting between ANSI and UTF8. – selbie Jan 13 '11 at 07:33
0

It's kind of hard to say, but I'd start by looking for a BOM. Most Windows programs that write Unicode files emit BOMs.

If these files exist in your codebase presumably they compile. You might ask yourself whether you really need to do this "tidying up". If you do need to do it then I would ask how the tool chain that processes these files discovers their encoding. If you know that then you'll be able to use the same diagnostic.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • We're seeing the google closure compiler ignore JS files that are encoded in UTF8, which is the reason for this "tidying up". I suppose grep'ing for BOM is probably the cheapest solution here. – HOCA Jan 12 '11 at 20:42
  • @HOCA How would the Google closure compiler know to ignore a file unless it had a BOM? I'd bet that these files have BOMs and so grep will do the job. – David Heffernan Jan 12 '11 at 20:45
  • @HOCA Well, grep will find them but you may want to use a Perl/Python/Ruby/whatever script to actually convert them, if there are a lot. – David Heffernan Jan 12 '11 at 20:46