-1

I have like that text:

"\ud83d \udc63 \ud83c \udf3f \ud83d \udca6 DE BOUCAN LALEU etc....Sur sa cr\u00eate se dressent"

How can I make it normal text in C#? It should be like below.

DE BOUCAN LALEU etc....Sur sa crête se dressent

I got this text from a script in the HTML Document. It was like this, so I am not creating this, I just extract it using a regex match, and I would like to save this into the MongoDB.

leo
  • 445
  • 8
  • 25
  • Does this answer your question? [Converting from hex to string](https://stackoverflow.com/questions/724862/converting-from-hex-to-string) –  Oct 05 '20 at 09:33
  • Whats the mapping between those codes and the icons? – Liam Oct 05 '20 at 09:35
  • 2
    Where did this text come from? Those are Unicode escape sequences, not hex codes. .NET doesn't need them, as this page proves - SO is a .NET which means strings are already Unicode. It stores strings in `nvarchar` (Unicode) fields in SQL Server. You could just copy ` DE BOUCAN LALEU etc....Sur sa crête se dressent` into a text field as-is. As I just did – Panagiotis Kanavos Oct 05 '20 at 09:40
  • Do those escape sequences exist though? Or is this how the debugger's watch window displays some characters? If the escape sequences really exist in the string, the producer of the string has a serious bug, emitting escape sequences instead of proper UTF8 text – Panagiotis Kanavos Oct 05 '20 at 09:41
  • Long story short, you don't need to convert Unicode text in .NET. The string's producer has a bug – Panagiotis Kanavos Oct 05 '20 at 09:42
  • 2
    What tool are you using to display the text? The tool is the issue not the text inside the file. – jdweng Oct 05 '20 at 09:48
  • @jdweng the text is actually the issue, not what's displaying it. Unicode escape sequences shouldn't have a space between them. So instead of `"\ud83d \udc63"` it should be `"\ud83d\udc63"` as [this](https://dencode.com/en/string/unicode-escape?v=%5Cud83d%5Cudc63%0A%0A%5Cud83d%20%5Cudc63&nl=crlf) DenCode example shows – MindSwipe Oct 05 '20 at 11:54
  • @MindSwipe: The space have nothing to do with the issue. The space will just put spaces between the characters. – jdweng Oct 05 '20 at 11:58
  • @jdweng You are mistaken. Those characters are surrogate pairs, and it will not work if there's a space between them because they would no longer form a surrogate pair. You can demonstrate this simply by displaying the string in a message box with and without the spaces. – Matthew Watson Oct 07 '20 at 12:22
  • @Watthew Watson : Unicode has combination of one byte and two byte characters. They are not always pairs. So unicode has more than on type space. – jdweng Oct 07 '20 at 12:49

3 Answers3

2

You should not have spaces between unicode characters that represent surrogate pairs.

Your string should look like so:

"\ud83d\udc63 \ud83c\udf3f \ud83d\udca6 DE BOUCAN LALEU etc....Sur sa cr\u00eate se dressent";

You can test this in a WinForms app using MessageBox.Show():

MessageBox.Show("\ud83d\udc63 \ud83c\udf3f \ud83d\udca6 DE BOUCAN LALEU etc....Sur sa cr\u00eate se dressent");

Note that the default font for the console doesn't support those Unicode characters, so Console.WriteLine() will display box characters for the unsupported Unicode characters.

Also note that normal WinForms controls don't support colour emojis, so those special characters are going to be displayed in black and white.

Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
0

Have you tried googling Escape sequences in strings C# ? Also this article on the Internet can help you understand strings better https://csharpindepth.com/articles/Strings

In this article it says \uxxxx - Unicode escape sequence for character with hex value xxxx So basically I don't think you have to do anything to change the string. It is well crafted already. Try displaying it with Console.WriteLine(thisString); // or whatever your string name is if you are running a Console Application or if you are in a windows forms application try MessageBox.Show(thisString); // Where thisString is equal to your string, you will see it is the same as the unicode escape sequences for those fancy pictures are already there in the string. Try it.

Soliman Soliman
  • 159
  • 1
  • 4
  • 17
0

I have fixed it myself by using the following code:

System.Text.RegularExpressions.Regex.Unescape("\ud83d\udc63\ud83c\udf3f\ud83d\udca6 DE BOUCAN LALEU etc....Sur sa cr\u00eate se dressent");
leo
  • 445
  • 8
  • 25
  • 1
    No, you fixed it by removing the spaces from between the surrogate pairs. Your call to `System.Text.RegularExpressions.Regex.Unescape()` actually does nothing whatsoever - if you compare the string you pass in to the string you get back, you'll find it's identical! – Matthew Watson Oct 07 '20 at 12:27
  • 1
    See this .Net Fiddle example: https://dotnetfiddle.net/aCx9nr – Matthew Watson Oct 07 '20 at 12:45