0

For the sake of the example, let's assume I am parsing some text written in German. This means that it contains symbols like ü or Ö. The problem is that when all German specific symbols get rendered as an empty square. Please take a look at this image:

Image http://img8.imageshack.us/img8/7502/93341046.png

Since I do not know whether this symbol is ü or Ö I want to replace it with "." (dot). So the string from the image above, should become "Osnabr.ck". How do I do that? Any help would be greatly appreciated!

Best Regards, Kiril

Kiril Stanoev
  • 1,865
  • 2
  • 20
  • 33
  • Can you please tell us what you actually want to achieve? You want to be sure that the text is 'Osnabrück', not 'OsnabrÖk' or anthing else? – Stefan Steinegger Apr 21 '09 at 22:38

3 Answers3

6

You can use a regular expression to replace any characters that you don't want. Just put the characters that you want in a negative set:

str = Regex.Replace(str, "[^0-9A-Za-z _]", ".");

You should look into what encoding you are using to decode the text. It looks like you are not using the same encoding as was used to encode the text as the characters doesn't show up correctly.

Guffa
  • 687,336
  • 108
  • 737
  • 1,005
  • Having a dot for every character that is not a number or a character A-Z is as good as having a square! This is NOT a solution! – Stefan Steinegger Apr 20 '09 at 08:53
  • Well, that's the solution that the OP asked for. Offering an actual answer to the question often makes it easier to suggest a better alternative than to begin with pointing out flaws in the question... – Guffa Apr 20 '09 at 09:24
  • 1
    No, he asked to get rid of the square because he doesn't know if the square is a 'ä' or 'ö'. So turning all the squares to dots is just useless. He wants to turn exactly one of the characters to a dot. – Stefan Steinegger Apr 20 '09 at 20:27
  • He wants to know that it is "Osnabrück" AND NOT "OsnabrÖck". So he only wants to replace the "ü" AND NOT ANYTHING ELSE. This regex does replace ALL non printable characters, and he STILL dosn't know if it is actually an ü or not! – Stefan Steinegger Apr 20 '09 at 23:01
  • Please read the question again... He wants the same result regardless of which character it should have been. – Guffa Apr 21 '09 at 01:56
  • No. He wants to know what the character is. He already HAS the same character for each, it's an "empty square". – Stefan Steinegger Apr 21 '09 at 06:43
  • How do you know that? It doesn't say so anywhere in the question. – Guffa Apr 21 '09 at 09:23
  • Quote: "Since I do not know whether this symbol is ü or Ö ..." Isn't it obvious? He doesn't know what the character is behind the empty square and tries to find it out in the immediate window... This answer doesn't solve the problem. – Stefan Steinegger Apr 21 '09 at 14:47
  • Yes, it's obvious that he doesn't know which character it is, but it doesn't say anywhere that he wants to know which character it is. What it does say is that the square characters is the problem, and that the string should become "Osnabr.ck". This answer does solve that problem. – Guffa Apr 21 '09 at 16:44
  • whats the big deal, just change the 3rd parameter from a "." to " " problem solved... one less argument, lol. – Anonymous Type Jul 13 '10 at 00:13
2

If you want to see the actual characters (and I notice you are displaying the value in the immediate window in visual studio), you need to use a font that can display the characters. The presence of the square means the font you are using does not contain glyphs that match those characters. You can change the font used in various parts of Visual Studio in the options dialog.

Some more detail in this question here.

Community
  • 1
  • 1
1800 INFORMATION
  • 131,367
  • 29
  • 160
  • 239
  • Not just the font, I believe he needs to change the default text encoding. – dirkgently Apr 18 '09 at 09:52
  • Every font on Windows has at least Basic Latin support, that means umlauts like ä or ü should display fine. This is definitely an encoding issue. – Joey Apr 18 '09 at 09:53
  • 1
    I doubt it. The box simply means that the font does not supply glyphs for that character. If it is an encoding issue, this will tend to be displayed as question marks or garbage characters. Please read the link I provided – 1800 INFORMATION Apr 18 '09 at 09:55
  • The box is what's shown when the character is not in the font's character set, and the most likely reason for this is using the wrong encoding so that the character is decoded into a different character. – Guffa Apr 18 '09 at 13:22
  • I don't really want to get into any kind of argument about this, but I would say the most likely reason you would find the character is not in the font is because you used the wrong font – 1800 INFORMATION Apr 19 '09 at 01:45
  • As it's the immediate window in Visual Studio it's very likely that it's the default font, which supports unicode. As all strings in .NET are unicode, it's much more likely that the string is incorrectly decoded than that a non-unicode font is used. – Guffa Apr 20 '09 at 00:44
-1

There is a Replace method on the string class. It's easiest to replace a single character with something else:

InnerText.Replace("ü", ".");

You can change several characters at the same time by chaining Replace:

InnerText.Replace("ü", "[ue]").Replace("Ö", "[Oe]");
Stefan Steinegger
  • 63,782
  • 15
  • 129
  • 193
  • Regex should be (a) more maintainable and (b) more efficient in this case. I suspect Ö and ü being merely examples in this case. Would you really want to chain a few hundred thousand Replace calls; one for every character you want to disallow? – Joey Apr 18 '09 at 12:28
  • Obviously a Replace is just as risky, as ju managed to use a comma as replacement string in the second call... ;) Also replacing those charactrers doesn't work at all, as the characters aren't actually "Ö" or "ü". If they were, they would show up correctly. – Guffa Apr 20 '09 at 21:52
  • You dind't understand the problem. He wants to distinguish the ü from all other characters that are not displayable and appear as a square. Ö and ü ARE NOT examples, actually only the ü was required. Didn't you see that he is in a immediate window? – Stefan Steinegger Apr 20 '09 at 22:09
  • The comma was not an accident, it's all about distinguishing the characters that can't be displayed... It's only code for the Immediate Window... Don't you all read the question? – Stefan Steinegger Apr 20 '09 at 22:14
  • I don't know where you get this deeper understanding of the question that isn't written in the question... It doesn't say anything about distinguishing the characters anywhere in the question. – Guffa Apr 21 '09 at 02:01
  • Quote: "Since I do not know whether this symbol is ü or Ö ..." – Stefan Steinegger Apr 21 '09 at 06:43
  • That's half a sentence. Perhaps you should read the rest of the sentence also, and you see that it doesn't say anything about wanting to distinguish between the characters. – Guffa Apr 21 '09 at 16:48
  • No it doesn't say, because it is obvious ... I give up. If you don't want to see, you won't. – Stefan Steinegger Apr 21 '09 at 20:21
  • It's not there, so it's impossible to see. You are reading between the lines and guessing, so you see what you want to see... – Guffa Apr 21 '09 at 22:02
  • Yes, sure, he wants to replace all empty squares with dots because he does not know if it is a 'ü' or something else. This REALLY makes sense. – Stefan Steinegger Apr 21 '09 at 22:36