1

I'm using requests.get to retrieve data from Google Ngrams.

I'm having a problem where, when I query the website for a string with an accent character in it (in this case I'm searching "marcher d'un pas lourd"), it returns information for "marcher d' un pas lourd".

As you can see in the returned string, the apostrophe has been replaced with the four-digit Unicode for an apostrophe.

This messes up the rest of my code, as I use my original string query ("marcher d'un pas lourd") to find the data I need from the returned data.

Is there any function or program that will search and convert four-digit Unicode in a string of otherwise normal characters? Note that I DO NOT want to remove these special characters, but rather get them to their correct representation within my code.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
boseHere
  • 15
  • 2

1 Answers1

1

Those are call HTML entities, and they can be unescaped with:

>>> s="marcher d' un pas lourd"
>>> import html
>>> html.unescape(s)
"marcher d' un pas lourd"
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251