3

I am trying to parse a string and remove the 'emojis' off it and keep the new lines.

So, I have this piece of code:

string text = "S H A V A . Est 2015\nBandung\nLine: @ubm5921j\nBbm: 7D2E6310\nFAST ORDER\ud83d\udc47\ud83c\udffe\ud83d\udc47\ud83c\udffe";
MessageBox.Show(text);
string result = Regex.Replace(text, @"\p{Cs}", "");

The output of 'text' here is the following:

enter image description here

So, as you can see the new lines work fine and the end of it has 'emojis' and the next line it removes them PERFECTLY. So the result string will contain the same string with new lines and no emojis.

On another part of the program I have this code.

//uu.description is the same string as above 'text', 
//this is where I scrape directly from html
string text2 = uu.description; 
MessageBox.Show(text2);
string result2 = Regex.Replace(text2, @"\p{Cs}", "");

enter image description here

As you can see in this case, my text2 outputs the string in the format as it is, and the regex does absolutely nothing. The new lines don't work and the emojis are not removed.

I am very confused why it does work in my first case and not in the second case. I've been on this for hours and can't figure it out.

Pilgerstorfer Franz
  • 8,303
  • 3
  • 41
  • 54
user5204184
  • 341
  • 6
  • 15
  • Try `string text2 = Regex.Unescape(uu.description);` and replace the characters with `@"\p{Cs}"`. Or check the scraping code: you get all the characters escaped at some point. Please show the HTML scraping code. – Wiktor Stribiżew Aug 12 '15 at 13:08
  • `string text2 = HttpUtility.HtmlDecode(uu.description);` – Khanh TO Aug 12 '15 at 13:10
  • `string text2 = WebUtility.HtmlDecode(uu.description);` if you use .NET 4.0 and above – Khanh TO Aug 12 '15 at 13:12
  • @stribizhev The scraping code is kind of too long to show. Basically, I have a web client and I use requestString() to download the page and then I scrape it off there. I tried `string lmao = Regex.Unescape(uu.description);` and then `lmao = Regex.Replace(testz, @"\p{Cs}", "");` but same result – user5204184 Aug 12 '15 at 13:14
  • @KhanhTO Thanks for the answer, my target framework in the properties shows .NET Framework 4 Client Profile, but I get an error that `The name 'HttpUtility does not exist in the current context`. – user5204184 Aug 12 '15 at 13:15
  • Try adding a reference to `System.Web` and importing the namespace with `using` – Khanh TO Aug 12 '15 at 13:16
  • @KhanhTO I went to Refernces > .NET, but I can't find `System.Web` reference. There's only `System.Web.ApplicationServices` and `System.Web.Services`. – user5204184 Aug 12 '15 at 13:17
  • Perhaps, you need to set the Encoding to UTF8: `webClient.Encoding = Encoding.UTF8;` – Wiktor Stribiżew Aug 12 '15 at 13:18
  • it could be that you already reference it. Try importing the namespace – Khanh TO Aug 12 '15 at 13:19
  • @stribizhev I just tried that and it didn't seem to work. – user5204184 Aug 12 '15 at 13:20
  • @KhanhTO Alright, I just imported it. My code is `string text2 = WebUtility.HtmlDecode(uu.description);` `string result2 = Regex.Replace(text2, @"\p{Cs}", "");` but still it doesn't work. Both text2 and result2 show the same text with no new lines happening. – user5204184 Aug 12 '15 at 13:23
  • Are you sure they are? Since the final results are different, there is a difference. Please hover other the `uu.Message` in the IDE and check what the string looks like. Make a screenshot, if possible. – Wiktor Stribiżew Aug 12 '15 at 14:01
  • @stribizhev Can you explain me how to do this? When I hover over uu.description in the IDE, it just shows `ClassName.description`. – user5204184 Aug 12 '15 at 14:27
  • If you work in Visual Studio, when debugging, hover the cursor over the code, right on the `Description`. – Wiktor Stribiżew Aug 12 '15 at 14:41
  • Change .NET Framework 4 Client Profile to .NET Framework 4 The Client Profile is a smaller limited .NET library and will cause issues such as certain parts of the library to be missing. – raddevus Aug 12 '15 at 14:46
  • Your `MessageBox.Show` calls are showing us that `text` is *not* the same as `text2`. It isn't the regex's fault. – Rawling Aug 12 '15 at 14:48
  • @Rawling Yes, I do not blame Regex. I am confused why text and text2 are different, with the same text. The new lines are not working in the second case for some reason. – user5204184 Aug 12 '15 at 14:51
  • In that case, you need to show us how you're roundtripping `uu.description`. – Rawling Aug 12 '15 at 14:55
  • @Rawling Can you explain how to to do that? Thanks. – user5204184 Aug 12 '15 at 14:58
  • Show us where `uu` comes from... – Rawling Aug 12 '15 at 15:02
  • @Rawling uu comes from a class I have written. As I have previously written a comment, the code is kinda too long, so I'll give you the steps I use. I have a WebClient and I use the `requestString()` function to download the source. Then I parse it. I'll give you the exact URL I use for this post: https://instagram.com/shavahouse/. My program scrapes the bio in the profile `"biography":"S H A V A . Est 2015\nBandung\nLine: @ubm5921j\nBbm: 7D2E6310\nFAST ORDER\ud83d\udc47\ud83c\udffe\ud83d\udc47\ud83c\udffe"` this particular thing. If you need the full source, I can upload it on pastebin. – user5204184 Aug 12 '15 at 15:05
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/86804/discussion-between-user5204184-and-rawling). – user5204184 Aug 12 '15 at 15:39
  • Please post the code you are using to download site contents. That is where we should start looking for the culprit. I have just tried, and - with UTF8 as Webclient encoding - got `FAST ORDER👇🏾👇🏾"` in the response. – Wiktor Stribiżew Aug 12 '15 at 22:28
  • @stribizhev I have fixed the problem, by using Regex.Unescape() as you suggested above. Posted my solution in the answers. Thanks a lot :) – user5204184 Aug 13 '15 at 06:29
  • @user5204184: Actually, that was exactly my suggested solution. I think I should have posted the answer. – Wiktor Stribiżew Aug 13 '15 at 06:31
  • @stribizhev That's correct. I have written that in the answer itself. If you would like to post it as an answer, I will delete mine. – user5204184 Aug 13 '15 at 06:31
  • 1
    You already got credits, let it be. Just next time please let know if the suggested solution works for you and give credit to those who earned it. – Wiktor Stribiżew Aug 13 '15 at 06:32

2 Answers2

3

I have fixed it. My fixed code looks like this:

string text2 = uu.description;
string result2 = Regex.Replace(Regex.Unescape(text2), @"\p{Cs}", "");

For some reason, the parsed string was with an additional \, looking like this \\n. I would like to thank @stribizhev for his idea! Thank you.

user5204184
  • 341
  • 6
  • 15
-1

Try tis

           string text = "S H A V A . Est 2015\nBandung\nLine: @ubm5921j\nBbm: 7D2E6310\nFAST ORDER\ud83d\udc47\ud83c\udffe\ud83d\udc47\ud83c\udffe";
            string output = string.Join("",text.Select(x => Encoding.Unicode.GetBytes(new char[] { x })).Select(y => (y[1] << 8) + y[0]).Where(y => y < 256).Select(z => ((char)z).ToString()));
​

Output from code

S H A V A . Est 2015
Bandung
Line: @ubm5921j
Bbm: 7D2E6310
FAST ORDER​
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • What is this supposed to do? It gives the same output as text. – user5204184 Aug 12 '15 at 14:54
  • I converted string charactedrs to int[] so I can test if a character is < 256 (ASCII) or >= 256 (Unicode) and removed all characters >=256. Then converted back to string. – jdweng Aug 12 '15 at 17:52
  • I see. My goal is to keep the new lines and remove the emoji only. So \n has to stay in the string. And also, I think the problem is that the new lines are not working, they appear as normal letters, rather than new lines. – user5204184 Aug 12 '15 at 17:54
  • I'm not changing '\n' in my code. The new lines are working in windows. You can verify that '\n' is working by pasting string into notepad. I added to my answer the actual results. – jdweng Aug 12 '15 at 19:44