7

I just want to know if windows 1252 is a subset of UTF-8 or not? and what are the differences?

Thinking of migrating my DB from windows 1252 to UTF-8, any thoughts, opinions?

samg
  • 311
  • 1
  • 8
  • 21

3 Answers3

11

Windows-1252 is a subset of UTF-8 in terms of 'what characters are available', but not in terms of their byte-by-byte representation. Windows-1252 has characters between bytes 127 and 255 that UTF-8 has a different encoding for.

Any visible character in the ASCII range (127 and below) are encoded 1:1 in UTF-8.

So while you can convert between the two, A CP-1252 string is not guaranteed to be a valid UTF-8 string.

Evert
  • 93,428
  • 18
  • 118
  • 189
  • Ok, so I'm planning to migrate my DB from Window-1252 character set to UTF-8 by doing the following: exporting my DB (backup), truncating all tables, run the `alter database character set`, and finally import DB back... but in this case how do I detect if any characters will be lost or will need adjustments beforehand? – samg Aug 13 '19 at 19:46
  • 1
    @samg hard to say without knowing what RDBS system you're using. It's also off-topic for this question, so perhaps you can just open a new question. – Evert Aug 13 '19 at 20:02
  • @samg: you may just create new fields (columns) with new charset, and so you can compare both fields. You may want to create a fake database, to test behaviour. – Giacomo Catenazzi Aug 14 '19 at 11:48
  • 1
    Keep in mind that the code points in Unicode which are U+0080..U+009F are undefined. Windows-1252 has indeed some characters defined in the range hex 80..9F which are defined elsewhere in Unicode. So you have to be careful when writing a handler for that. – Gunnar Vestergaard Dec 02 '19 at 13:43
0

Ansi vs Utf8 in emacs hexl-mode. So Cr is 43 72, but then there's an e with an accent -- é. In ansi it's e9, but in utf8 it's c3 a9. Then the a is 61. The utf8 file also has its BOM or encoding signature in the beginning, ef bb bf.

         43 72    e9 61      Cr.a

ef bb bf 43 72 c3 a9 61  ...Cr..a 
js2010
  • 23,033
  • 6
  • 64
  • 66
  • 1
    Rally UTF-8 should never have a BOM (but it is OK if you converted the file from UTF-16 and you will convert back to UTF-16). Windows use it, to prepare the conversion, but it is just an hack. Additionally, there are two canonical way to code e with an accent in UTF-8. – Giacomo Catenazzi Aug 14 '19 at 11:30
0

Yes, Windows 1252 characters are a subset of Unicode.

Unicode, by design, implements lossless transformation back and forth from most (common) character encoding available in year 1993. CP-1252 is older then Unicode, and frequent used, so Unicode was designed to include all CP-1252.

This design was specified for your case: you may convert one layer at a time, without losing information, so without need a flag day. You just convert database, but and set the client [driver] to translate back to CP-1252. (Usually it is the default, clients know what coding you expect, and they know what database will deliver, so it will do transcoding). On a second step you can change the client part (and maybe later the front-end).

Just you should care about some problems: Unicode has various canonical form, and much more possible representation for the same character. From CP-1252 it is not a problem, but on the back way, you may have problems, depending on the library you use. If you need to convert back, just do some experiments.

Many code are the same in Unicode and in CP-1252, but the encoding UTF-8 requires two (or more) bytes for codes about 127, so these are not byte to byte compatible. But usually a simple lookup table (256 elements) is enough.

Non-printable characters are, in theory the same, but every system could change interpretation (e.g. new line, and form feed [new page or now often new section], or all escape sequences (starting with ^[). But this is not really relevant to you.

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
  • Keep in mind that the code points in Unicode which are U+0080..U+009F are undefined. Windows-1252 has indeed some characters defined in the range hex 80..9F which are defined elsewhere in Unicode. So you have to be careful when writing a handler for that. – Gunnar Vestergaard Dec 02 '19 at 13:44