1

I have an input file that is UTF-8 encoded. I need to use some of its content and create an ISO-8859-15 encoded CSV file from it.

The problem is that UTF-8 seems to have several characters for double quotes that are automatically replaced to the character " (= Quotation Mark U+0022) when writing the CSV file to the disc.

The ones we found are:

The conversion happens automatically when I write to the CSV file like this:

using (StreamWriter sw = new StreamWriter(workDir + "/files/vehicles.csv", append: false, encoding: Encoding.GetEncoding("ISO-8859-15")))
{
    foreach (ad vehicle in vehicles)
    {
        sw.WriteLine(convertVehicleToCsv(vehicle));
    }
}

The method convertVehicleToCsv escapes double quotes and other special characters of the data, but does not escape the special UTF-8 double quote characters. Now that the double quotes are replaced automatically the CSV is no longer RFC-4180 conform and therefore corrupt. Reading it using our CSV library fails.

So the question is:

What other UTF-8 characters are automatically replaced/converted to the "normal" " character when converting to ISO-8859-15? Is this documented somewhere? Or am I doing something wrong here?

Community
  • 1
  • 1
Krisztián Balla
  • 19,223
  • 13
  • 68
  • 84
  • Out of interest, what would you *expect* to happen in that situation? I assume that ISO-8859-15 just doesn't include those characters. – Jon Skeet Dec 02 '15 at 10:48
  • Well, I like it that they are replaced this way. But I need to know which characters are "automagically" handled like this. – Krisztián Balla Dec 02 '15 at 10:49
  • 1
    It sounds like you should probably just convert the original content to ISO-8859-15 as early as possible, so that the conversion happens *before* escaping. Would that solve it without having to be exhaustive about the replacements? You could find a good chunk of the replacements naively by just converting a string with every Unicode character in... but I don't know whether the encoder might be very smart with multiple characters in some cases.. – Jon Skeet Dec 02 '15 at 10:53

2 Answers2

2

To answer your question, here's the list of Unicode code points which .NET is mapping to U+0022 (what you've referred to as "normal double quote" symbol) when using a StreamWriter as you've done:

  • U+0022
  • U+02BA
  • U+030E
  • U+201C
  • U+201D
  • U+201E
  • U+FF02

Using this answer, I wrote something quickly which creates a reverse mapping of UTF-8 to ISO-8859-15 (Latin-9).

Encoding utf8 = Encoding.UTF8;
Encoding latin9 = Encoding.GetEncoding("ISO-8859-15");
Encoding iso = Encoding.GetEncoding(1252);

var map = new Dictionary<string, List<string>>();

// same code to get each line from the file as per the linked answer

while (true)
{
    string line = reader.ReadLine();
    if (line == null) break;
    string codePointHexAsString = line.Substring(0, line.IndexOf(";"));
    int codePoint = Convert.ToInt32(codePointHexAsString, 16);

    // skip Unicode surrogate area
    if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
        continue;

    string utf16String = char.ConvertFromUtf32(codePoint);
    byte[] utf8Bytes = utf8.GetBytes(utf16String);
    byte[] latin9Bytes = Encoding.Convert(utf8, latin9, utf8Bytes);
    string latin9String = latin9.GetString(latin9Bytes);
    byte[] isoBytes = Encoding.Convert(utf8, iso, utf8Bytes);
    string isoString = iso.GetString(isoBytes); // this is not always the same as latin9String!

   string latin9HexAsString = latin9[0].ToString("X");

    if (!map.ContainsKey(latin9HexAsString))
    {
        isoMap[latin9HexAsString] = new List<string>();
    }
    isoMap[latin9HexAsString].Add(codePointHexAsString);
}

Interestingly, ISO-8859-15 seems to be replacing more characters than ISO-8859-1, which I didn't expect.

ardila
  • 1,277
  • 1
  • 13
  • 24
  • It would make sense to use ISO-8859-1 as a fallback for ISO-8859-15, since ISO-8859-15 is the same but the international currency symbol is replaced by the Euro (€) symbol. I'm waiting for your updated answer. What is codePoint in your code? – Krisztián Balla Dec 02 '15 at 17:36
  • Your list of characters doesn't include mine, right? I update my question with your entries. – Krisztián Balla Dec 02 '15 at 17:42
  • 1
    @JennyO'Reilly that's not entirely correct. There's a few more differences. See https://en.wikipedia.org/wiki/ISO/IEC_8859-15#Differences_from_ISO-8859-1. In the linked question `codePoint` is an `int` representing the Unicode character code point. PS: my figuring out where in the innards of the framework this conversion is happening won't change the list of characters already given as answer to your question :) – ardila Dec 02 '15 at 18:13
  • 1
    Looks like that our lis is complete. I added `if (latin9String == "\"") { System.Console.WriteLine(codePoint.ToString("X")); }` to your loop and it printed the codes from my question / your answer. – Krisztián Balla Dec 03 '15 at 09:01
1

The .NET Framework uses best-fit mapping by default when converting from Unicode to legacy character encodings, such as ISO-8859-15. This is documented in the Windows Protocols Unicode Reference on MSDN. That document refers to a download called "Sorting Weight Tables" from the Microsoft Download Center, which include best-fit mappings for the legacy encodings supported by Windows (in the file "Windows Supported Code Page Data Files.zip", at the time of this writing).

Peter O.
  • 32,158
  • 14
  • 82
  • 96