Removing part of string starting with \ud

Question

I am trying to remove anything starting with \ud

My text:
onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The  CaboodlesÂ® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons

The answer I am looking for:
onceuponadollhouse: "Iconic apart and better together â€â™€ï¸The  CaboodlesÂ® x Barbieâ„¢ collection has us thinking about our Doll Code We stand for one another by sharing our lessons

Somewhat relevant (but not a solution) https://stackoverflow.com/a/54549164/5320906 — snakecharmerb, Jun 06 '21 at 17:45

score 2 · Answer 1 · answered Jun 06 '21 at 09:18

So the ideal way would be to take a step back, work out where in the process the encoding is getting mangled, then fix it. Somehow you're getting (a) surrogate pairs, which are the pairs of characters starting with \ud; and (b) UTF-8 interpreted as Latin-1 or some similar encoding, like the â„¢ after "Barbie".

Taking a step back and making sure that your input text is interpreted correctly would be ideal; here you're losing the emojis "woman with bunny ears" and "ribbon"; another time it might be somebody's name or other piece of important information.

If you're in a situation where you can't do it properly, and you need to strip the surrogate pairs, you can use re.sub:

import re

text = 'onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The  CaboodlesÂ® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons'

stripped = re.sub('[\ud800-\udfff]+', '', text)

print(stripped)

Depending on your purpose, it might be useful to replace those characters with a placeholder; since they always come in pairs, you might do something like this:

import re

text = 'onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The  CaboodlesÂ® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons'

stripped = re.sub('[\ud800-\udfff]{2}', '<unknown character>', text)

print(stripped)

score 0 · Answer 2 · answered Jun 06 '21 at 00:03

Check out the emot python package. I discovered it this morning in from this article: https://towardsdatascience.com/5-python-libraries-that-you-dont-know-but-you-should-fd6f810773a7

The examples given in the documentation only interpret and emojis, but it also gives their location, so it wouldn't be too much of stretch to replace them.

Removing part of string starting with \ud

2 Answers2