Is there a way to identify and convert nonescaped four-digit Unicode characters within a string of normal characters?

Question

I'm using requests.get to retrieve data from Google Ngrams.

I'm having a problem where, when I query the website for a string with an accent character in it (in this case I'm searching "marcher d'un pas lourd"), it returns information for "marcher d' un pas lourd".

As you can see in the returned string, the apostrophe has been replaced with the four-digit Unicode for an apostrophe.

This messes up the rest of my code, as I use my original string query ("marcher d'un pas lourd") to find the data I need from the returned data.

Is there any function or program that will search and convert four-digit Unicode in a string of otherwise normal characters? Note that I DO NOT want to remove these special characters, but rather get them to their correct representation within my code.

score 1 · Accepted Answer · answered Oct 28 '19 at 06:39

1

Those are call HTML entities, and they can be unescaped with:

>>> s="marcher d&#39; un pas lourd"
>>> import html
>>> html.unescape(s)
"marcher d' un pas lourd"

answered Oct 28 '19 at 06:39

Mark Tolonen

166,664
26
169
251

Is there a way to identify and convert nonescaped four-digit Unicode characters within a string of normal characters?

1 Answers1