2

I have an old project reading files with CP850 encoding. But it handles accent characters wrong (e.g., Montréal becomes MontrÚal). I want to replace CP850 with UTF-8. The question is:

Is it safe? In other word, can we assume UTF-8 is a super set and Encoding the same way as CP850 encoding characters?

Thanks

I tried hexdump, below is the sample of my csv file, is it UTF-8?

000000d0  76 20 64 65 20 4d 61 72  6c 6f 77 65 2c 2c 4d 6f  |v de Marlowe,,Mo|
000000e0  6e 74 72 c3 a9 61 6c 2c  51 43 2c 48 34 41 20 20  |ntr..al,QC,H4A  |
Eric
  • 1,031
  • 4
  • 14
  • 29
  • "I have a CSV file and without encoding information.": Then you have lost data. Without this essential metadata, a text file just contains bytes. Character encoding usage is an agreement between writer and readers. You can only change the encoding after changing the agreement. – Tom Blodget Jul 13 '18 at 01:46

1 Answers1

4

If by superset you mean does UTF-8 include all the characters of CP850, then trivially yes, since UTF-8 can encode all valid Unicode code points using a variable-length encoding (1–4 bytes).

If you mean are characters encoded the same way, then as you've seen this is not the case, since é (U+00E9) is encoded as 82 in CP850 and C3 A9 in UTF-8.

I cannot see a character set / code page that encodes Ú as 82, but Ú is encoded as E9 in CP850, which is the ISO-8859-1 representation of é, so it's possible you've got your conversion the wrong way around (i.e. you're converting your file from ISO-8859-1 to CP850, and you want to convert from CP850 to UTF-8).

Here's an example using hd and iconv:

hd test.cp850.txt
00000000  4d 6f 6e 74 72 82 61 6c                           |Montr.al|
00000008

iconv --from cp850 --to utf8 test.cp850.txt > test.utf8.txt

hd test.utf8.txt
00000000  4d 6f 6e 74 72 c3 a9 61  6c                       |Montr..al|
00000009
cmbuckley
  • 40,217
  • 9
  • 77
  • 91
  • That's exactly what I'm asking for, thanks. Do you have a solution for that? I have a CSV file and without encoding information. How will we choose which encoding to use to read the file? – Eric Jul 12 '18 at 17:27
  • I have CSV files and was used CP850 encoding. But one of the CSV files failed reading correctly. How will we choose another super encoding to use to read the file and won't affect other files already correctly read using CP850? – Eric Jul 12 '18 at 17:36
  • 1
    If you are assuming it’s CP850, and you’re getting MontrÚal, I’d assume that file is in CP1252 or ISO-8859-1 instead. – cmbuckley Jul 12 '18 at 21:36
  • I know CP1252 encodes differently than CP850 even it's a super set. How about ISO-8859-1? can it be used to replace CP850 and no side-effect? – Eric Jul 13 '18 at 13:51
  • Your question really isn’t clear; it really depends what language your text uses, and what you mean by “replace.” What you _should_ be doing is working out the charset of each text (e.g. CP1252, CP850 etc) and *converting* them all to a standard, known format such as UTF-8. It is difficult to reliably recover the charset of arbitrary text, but pretty much all encodings are the same for ASCII code points (e.g. Latin alphabet), so you may need to do some checks where there are characters outside this, and choose the most appropriate charset. I find www.fileformat.info helpful for this. – cmbuckley Jul 15 '18 at 16:41
  • I have a CSV file, which explicitly uses UTF-8 encoding, but when I read the file in java code, Montréal becomes Montr?al. What might be the problem? BTW, the env is unix. – Eric Jul 31 '18 at 15:57
  • Are you sure it's in UTF-8 encoding? How do you know? Check the bytes like above, use `hd` or equivalent tool to find out the byte sequence. You can also use programmatic tools to guess the correct encoding: https://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream – cmbuckley Aug 01 '18 at 08:36
  • I updated my question and added a sample of my CSV file byte sequence. Can you please verify it's utf-8? – Eric Aug 01 '18 at 13:23
  • 1
    You can see the bytes `c3 a9` for the é, so it's definitely UTF-8 encoding. To answer your original question, UTF-8 is definitely a superset of CP850, because UTF-8 can represent all Unicode characters. But it is not a _binary superset_, because the same characters are represented by different bytes in each charset. – cmbuckley Aug 01 '18 at 14:22