How to detect Windows-1251 encoded characters

Question

Is there a proper way to detect the Windows-1251 encoded characters ?

IMO, unlike multiple-byte native characters, Windows-1251 is an 8-bit character encoding, so it's impossible to distinguish it from other 8-bit native characters like latin1. If I am wrong on this, please correct me.

The first clue to me is locale, I take all the non-ascii characters as Windows-1251 if the locale is ru.

Are there any better ways ?

UPDATE:

Here is the context of my question, there are some Windows-1251 encoded characters in the ID3 info of a MP3 files, I have to detect the Windows-1251 encoded characters and then convert them to UTF-16 using icu4c , otherwise those Windows-1251 encoded characters will represented unreadable on my system(Android). I deem maybe some of you have better ways.

Some MP3 files have Cyrillic characters in the ID3 Tags which encoded in Windows-1251. — Alan, Jul 09 '13 at 10:29
So you want to be able to take an MP3 file and discern whether or not the ID3 tags are encoded in 1251? — David Heffernan, Jul 09 '13 at 10:38
I want to tell whether the ID3 tags are encoded in 1251. Then I can convert the 1251 to UTFs properly using icu4c. Cause some of the 1251 encoded characters is represented unreadable in my system(Android). Do I make myself clear? — Alan, Jul 09 '13 at 11:08

score 1 · Answer 1 · answered Jul 09 '13 at 09:34

The GetACP function can be used to determine this. It returns the identifier of the ANSI code page that is currently active for the system.

The documented list of code page identifiers can be found here. The one you're looking for is 1251, which corresponds to the "ANSI Cyrillic (Windows)" code page.

Very simple to use from code; e.g. in C:

#include <Windows.h>

int main()
{
    if (GetACP() == 1251)
    {
        MessageBoxW(NULL,
                    L"Your system uses the ANSI Cyrillic code page.",
                    L"Code Page Detection",
                    MB_OK | MB_ICONINFORMATION);
    }
    return 0;
}

Thanks, but maybe you misunderstand my question due to my bad English. Actually, I have to differentiate the native characters to see if it's Windows-1251(Cyrillic) encoded, if so I'll convert them to UTF using icu4c. Do I make myself clear ? — Alan, Jul 09 '13 at 10:46

score 0 · Accepted Answer · answered Jul 09 '13 at 11:14

0

There is no reliable way to detect, when given as input an array of 8 bit characters, which 8 bit encoding has been used for those characters.

answered Jul 09 '13 at 11:14

David Heffernan

601,492
42
1,072
1,490

So using locale is actually a proper way ? – Alan Jul 09 '13 at 11:25
1

No. My machine does not use 1251 and your files will still contain 1251 encoded tags when you move the file to my machine. – David Heffernan Jul 09 '13 at 11:26
I mean, if the locale is ru(Russian) then I treat the single byte non-ascii characters as Windows-1251. Because our target market is Russian so I met the unreadable characters issue in some Russian MP3 files. – Alan Jul 09 '13 at 12:25
That's up to you. Of course, you may well encounter lots of mp3 files with UTF-8 tags. – David Heffernan Jul 09 '13 at 15:01
1

@Alan You cannot reliably detect the code page of a file, you need to be told it along with the file. The only way to do it would be to build a really complex set of heuristics, which would be a lot of work to research, test, and otherwise make work correctly. Big problems come, for example, in distinguishing between ISO-8859-1 and Windows-1252, where the differences are extremely minor. More information in this question: [How can I detect the encoding/codepage of a text file](http://stackoverflow.com/q/90838). – Cody Gray - on strike Jul 10 '13 at 01:43
1

I recommend asking the user to identify their file if it doesn't contain identifying information already. That's how almost all text editors do it, except the ones that guess wrong (like Notepad). – Cody Gray - on strike Jul 10 '13 at 01:44
@CodyGray You're right, I can't agree more. – Alan Jul 10 '13 at 01:53

How to detect Windows-1251 encoded characters

2 Answers2