-2

I'm parsing a number of text files that contain 99.9% ascii characters. Numbers, basic punctuation and letters A-Z (upper and lower case).

The files also contain names, which occasionally contain characters which are part of the extended ascii character set, for example umlauts Ü and cedillas ç.

I want to only work with standard ascii, so I handle these extended characters by processing any names through a series of simple replace() commands...

myString = myString.Replace("ç", "c");
myString = myString.Replace("Ü", "U");

This works with all the strange characters I want to replace except for Ø (capital O with a forward slash through it). I think this has the decimal equivalent of 157.

If I process the string character-by-character using ToInt32() on each character it claims the decimal equivalent is 65533 - well outside the normal range of extended ascii codes.

Questions

  • why doesn't myString.Replace("Ø", "O"); work on this character?
  • How can I replace "Ø" with "O"?

Other information - may be pertinent. Opening the file with Notepad shows the character as a "Ø". Comparison with other sources indicate that the data is correct (i.e. the full string is "Jørgensen" - a valid Danish name). Viewing the character in visual studio shows it as "�". I'm getting exactly the same problem (with this one character) in hundreds of different files. I can happily replace all the other extended characters I encounter without problems. I'm using System.IO.File.ReadAllLines() to read all the lines into an array of strings for processing.

ConanTheGerbil
  • 677
  • 8
  • 21
  • You may want to search for code for generating “slug” which is what you seem to be doing – Alexei Levenkov Dec 27 '20 at 22:40
  • Converting the string to codepage [20127](https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers) which is US-ASCII will do those substitutions for you. – dxiv Dec 27 '20 at 23:26
  • 1
    I think you are using the wrong encoding. U+FFFD (decimal 65533) is the "replacement character" when a decoder encounters an invalid sequence of bytes https://stackoverflow.com/a/3527176/2355006. Check what encoding Notepad is showing. – David Specht Dec 28 '20 at 02:15
  • You read the file with the wrong encoding (UTF-8). Read it with the correct one and normalize the string with [String.Normalize](https://learn.microsoft.com/en-us/dotnet/api/system.string.normalize?view=net-5.0) method. Check my answer for more details. – fenixil Dec 28 '20 at 02:50

1 Answers1

1
  1. Replace works fine for the 'Ø' when it 'knows' about it:
  Console.WriteLine("Jørgensen".Replace("ø", "o"));

In your case the problem is that you are trying to read the data with the wrong encoding, that's why the string does not contain the character which you are trying to replace. Ø is part of the extended ASCII set - iso-8859-1, but File.ReadAllLines tries to detect encoding using BOM chars and, I suspect, falls back to UTF-8 in your case (see Remarks in the documentation).

The same behavior you see in the VS code - it tries to open the file with UTF-8 encoding and shows you �: Wrong ecoding If you switch the encoding to the correct one - it shows the text correctly: Correct encoring

If you know what encoding is used for your files, just use it explicitly, here is an example to illustrate the difference:

            // prints J?rgensen
            File.ReadAllLines("data.txt")
                .Select(l => l.Replace("Ø", "O"))
                .ToList()
                .ForEach(Console.WriteLine);
            // prints Jorgensen
            File.ReadAllLines("data.txt",Encoding.GetEncoding("iso-8859-1"))
                .Select(l => l.Replace("Ø", "O"))
                .ToList()
                .ForEach(Console.WriteLine);
  1. If you want to use chars from the default ASCII set, you may convert all special chars from the extended set to the base one (it will be ugly and non-trivial). Or you can search online how to deal with your concern, and you may find String.Normalize() or this thread with several other suggestions.
        public static string RemoveDiacritics(string s)
        {
            var normalizedString = s.Normalize(NormalizationForm.FormD);
            var stringBuilder = new StringBuilder();

            for(var i = 0; i < normalizedString.Length; i++)
            {
                var c = normalizedString[i];
                if(CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
                    stringBuilder.Append(c);
            }

            return stringBuilder.ToString();
        }
...
            // prints Jorgensen
            File.ReadAllLines("data.txt", Encoding.GetEncoding("iso-8859-1"))
                .Select(RemoveDiacritics)
                .ToList()
                .ForEach(Console.WriteLine);

I'd strongly recommend reading C# in Depth: Unicode by Jon Skeet and Programming with Unicode by Victor Stinner books to have a much better understanding of what's going on :) Good luck.

PS. My code example is functional, compact but pretty inefficient, if you parse huge files consider using another solution.

fenixil
  • 2,106
  • 7
  • 13