-2

Why is the txt being converted to the txt with the method below?

???????????? ???????????????? ???????????????????? ?????????????? ???????? ?????? ????????

This I believe did not happen before but I just saw it doing it. I am using .NET 4.8.

 public static string RemoveAccent(this string txt)

    {
        if(txt == null)
        return txt;

        byte[] bytes = Encoding.GetEncoding("Cyrillic").GetBytes(txt);
        return Encoding.ASCII.GetString(bytes);
    }
tnw
  • 13,521
  • 15
  • 70
  • 111
Mike Flynn
  • 22,342
  • 54
  • 182
  • 341
  • 2
    You are encoding `txt` to a byte format using Cyrillic encoding, and then trying to pretend that data is ASCII, even though it isn't. – ProgrammingLlama Oct 16 '20 at 13:32
  • I could of swore this worked before and would remove accent from a string keeping in place valid characters. https://stackoverflow.com/questions/10161598/how-can-i-replace-a-string-with-this-rules – Mike Flynn Oct 16 '20 at 13:33
  • `ASCII` is the 7-bit US-ASCII. What are you trying to do ? There's no need for such code in .NET, strings are Unicode and can handle any codepage. Are you trying to recover text that was mangled by another incorrect codepage conversion? – Panagiotis Kanavos Oct 16 '20 at 13:33
  • https://stackoverflow.com/questions/10161598/how-can-i-replace-a-string-with-this-rules – Mike Flynn Oct 16 '20 at 13:34
  • What accents are you trying to remove from what character set? – ProgrammingLlama Oct 16 '20 at 13:34
  • @MikeFlynn that thing would never, ever work. *No* Cyrillic character can be represented in US-ASCII. – Panagiotis Kanavos Oct 16 '20 at 13:34
  • 1
    I think [this is what you're looking for](https://stackoverflow.com/a/249126/3181933), assuming you're doing this for a language using the latin alphabet. – ProgrammingLlama Oct 16 '20 at 13:35
  • Ok but I am telling you this has worked for years, and the article I sent has an accepted solutioin of it too. – Mike Flynn Oct 16 '20 at 13:35
  • 1
    @MikeFlynn you mistook mangling for working. Same as that bad answer's author. If you asked someone outside the US or UK they'd tell you this would never work, it would replace non US characters with mangled characters or `?` (in the UK diacritics are used in names). The regex removes the mangled characters. – Panagiotis Kanavos Oct 16 '20 at 13:39
  • As for that particular text, it's obvious the characters are *not* the A-Z characters. Unicode has characters or surrogates that change how characters are displayed, using wider spacing etc. That's why all of those seemingly 7-bit characters weren't converted to US-ASCII. Try normalizing that string first. The *second* answer in the linked question is what you want – Panagiotis Kanavos Oct 16 '20 at 13:42
  • I understand but this method is used to generate a url slug, and have never had an issue with it. We would of found the issue the first time we tested this because its used throughout the site over and over but we just found the issue coming up after upgrading the site. It seems to have worked fine but now it works as expected like you guys say. – Mike Flynn Oct 16 '20 at 13:44
  • I did figure out there is a different encoding on this string but you cant tell just by looking at it other then its bolder. This must be the issue with the user copying and pasting text that was encoded somehow then typing it in but looks normal. – Mike Flynn Oct 16 '20 at 13:57
  • The strings first character is `symbol: , code point: 1d5dc, position: 1`, so non ASCII, but Unicode Character 'MATHEMATICAL SANS-SERIF BOLD CAPITAL I' (U+1D5DC) – Mike Flynn Oct 16 '20 at 14:00

1 Answers1

0

The text was in some sort of Unicode encoding and why it was acting differently then before with ASCII encoded text. So I did this below before the GetEncoding and it works now.

if(!txt.IsNormalized(NormalizationForm.FormKD))
            {
                txt= txt.Normalize(NormalizationForm.FormKD);
            }
Mike Flynn
  • 22,342
  • 54
  • 182
  • 341