Which double quote characters are automatically replaced when converting from UTF-8 to ISO-8859-15?

Question

I have an input file that is UTF-8 encoded. I need to use some of its content and create an ISO-8859-15 encoded CSV file from it.

The problem is that UTF-8 seems to have several characters for double quotes that are automatically replaced to the character " (= Quotation Mark U+0022) when writing the CSV file to the disc.

The ones we found are:

The conversion happens automatically when I write to the CSV file like this:

using (StreamWriter sw = new StreamWriter(workDir + "/files/vehicles.csv", append: false, encoding: Encoding.GetEncoding("ISO-8859-15")))
{
    foreach (ad vehicle in vehicles)
    {
        sw.WriteLine(convertVehicleToCsv(vehicle));
    }
}

The method convertVehicleToCsv escapes double quotes and other special characters of the data, but does not escape the special UTF-8 double quote characters. Now that the double quotes are replaced automatically the CSV is no longer RFC-4180 conform and therefore corrupt. Reading it using our CSV library fails.

So the question is:

What other UTF-8 characters are automatically replaced/converted to the "normal" " character when converting to ISO-8859-15? Is this documented somewhere? Or am I doing something wrong here?

Out of interest, what would you *expect* to happen in that situation? I assume that ISO-8859-15 just doesn't include those characters. — Jon Skeet, Dec 02 '15 at 10:48
Well, I like it that they are replaced this way. But I need to know which characters are "automagically" handled like this. — Krisztián Balla, Dec 02 '15 at 10:49
It sounds like you should probably just convert the original content to ISO-8859-15 as early as possible, so that the conversion happens *before* escaping. Would that solve it without having to be exhaustive about the replacements? You could find a good chunk of the replacements naively by just converting a string with every Unicode character in... but I don't know whether the encoder might be very smart with multiple characters in some cases.. — Jon Skeet, Dec 02 '15 at 10:53

ardila · Accepted Answer · 2019-07-12T09:42:36.597

To answer your question, here's the list of Unicode code points which .NET is mapping to U+0022 (what you've referred to as "normal double quote" symbol) when using a StreamWriter as you've done:

U+0022
U+02BA
U+030E
U+201C
U+201D
U+201E
U+FF02

Using this answer, I wrote something quickly which creates a reverse mapping of UTF-8 to ISO-8859-15 (Latin-9).

Encoding utf8 = Encoding.UTF8;
Encoding latin9 = Encoding.GetEncoding("ISO-8859-15");
Encoding iso = Encoding.GetEncoding(1252);

var map = new Dictionary<string, List<string>>();

// same code to get each line from the file as per the linked answer

while (true)
{
    string line = reader.ReadLine();
    if (line == null) break;
    string codePointHexAsString = line.Substring(0, line.IndexOf(";"));
    int codePoint = Convert.ToInt32(codePointHexAsString, 16);

    // skip Unicode surrogate area
    if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
        continue;

    string utf16String = char.ConvertFromUtf32(codePoint);
    byte[] utf8Bytes = utf8.GetBytes(utf16String);
    byte[] latin9Bytes = Encoding.Convert(utf8, latin9, utf8Bytes);
    string latin9String = latin9.GetString(latin9Bytes);
    byte[] isoBytes = Encoding.Convert(utf8, iso, utf8Bytes);
    string isoString = iso.GetString(isoBytes); // this is not always the same as latin9String!

   string latin9HexAsString = latin9[0].ToString("X");

    if (!map.ContainsKey(latin9HexAsString))
    {
        isoMap[latin9HexAsString] = new List<string>();
    }
    isoMap[latin9HexAsString].Add(codePointHexAsString);
}

Interestingly, ISO-8859-15 seems to be replacing more characters than ISO-8859-1, which I didn't expect.

It would make sense to use ISO-8859-1 as a fallback for ISO-8859-15, since ISO-8859-15 is the same but the international currency symbol is replaced by the Euro (€) symbol. I'm waiting for your updated answer. What is codePoint in your code? — Krisztián Balla, Dec 02 '15 at 17:36
Your list of characters doesn't include mine, right? I update my question with your entries. — Krisztián Balla, Dec 02 '15 at 17:42
@JennyO'Reilly that's not entirely correct. There's a few more differences. See https://en.wikipedia.org/wiki/ISO/IEC_8859-15#Differences_from_ISO-8859-1. In the linked question `codePoint` is an `int` representing the Unicode character code point. PS: my figuring out where in the innards of the framework this conversion is happening won't change the list of characters already given as answer to your question :) — ardila, Dec 02 '15 at 18:13
Looks like that our lis is complete. I added `if (latin9String == "\"") { System.Console.WriteLine(codePoint.ToString("X")); }` to your loop and it printed the codes from my question / your answer. — Krisztián Balla, Dec 03 '15 at 09:01

score 1 · Answer 2 · answered Dec 02 '15 at 21:29

The .NET Framework uses best-fit mapping by default when converting from Unicode to legacy character encodings, such as ISO-8859-15. This is documented in the Windows Protocols Unicode Reference on MSDN. That document refers to a download called "Sorting Weight Tables" from the Microsoft Download Center, which include best-fit mappings for the legacy encodings supported by Windows (in the file "Windows Supported Code Page Data Files.zip", at the time of this writing).

Which double quote characters are automatically replaced when converting from UTF-8 to ISO-8859-15?

2 Answers2