2

I'm using the YouTube Data API to get some music's titles. But when I get the title and print it, the title just seems like Unicode characters. For example:

#music title: Røyksopp
print(title)
#Output: R\u00f6yksopp

Or:

#music title: Nurse's
print(title)
#Output: Nurse's

Why I'm getting this and how to fix this?

stvar
  • 6,551
  • 2
  • 13
  • 28
droppels
  • 61
  • 1
  • 5

2 Answers2

2

This is not an encoding but an escaping:

>>> import html
>>> html.unescape("Nurse's")
"Nurse's"

The other one is already decoded, nothing to be done:

>>> "R\u00f6yksopp"
'Röyksopp'

If you're still seeing all ASCII characters instead of accented text, it might be that you (or your client library) have missed a json de-serialization step somewhere:

>>> json.loads('"\\u00f6"')
'ö'
wim
  • 338,267
  • 99
  • 616
  • 750
  • @stvar Yes, you're right. It's a little strange though because the question shows Røyksopp text which would be codepoint \u00f8 (stroke) not \u00f6 (diaresis) – wim Nov 18 '20 at 19:56
  • That's why I asked the OP to post his code and the ID of his culprit video. – stvar Nov 18 '20 at 19:59
1

First, please acknowledge that what you've got from the API are not (quote from you) Unicode characters. To be technically precise, those sequence of characters are HTML character references, also known as HTML entities.

The behavior you've encountered is a well-known issue of the API, for which there's no other solution that I know of, except that you yourself have to substitute those HTML entities for the actual characters that they stand for.

In the context of Python 3, you could very well use the function html.unescape that is part of html module:

import html
print(html.unescape(title))

This code will produce Nurse's when title is Nurse's.


For what concerns your output R\u00f6yksopp, please post the code context that queries the API, for to see why the \uXXXX escape sequences are not processed properly by your program. You may also post the video ID that produced this output, for to check that myself.

stvar
  • 6,551
  • 2
  • 13
  • 28