0

I have a list of songs I am trying to use to search with through YouTube. However, when certain songs with special characters are used, the following error pops up:

Code:

import urllib.request
import re

search_kw = tracks[3]['Artist'] + '+' + tracks[3]['Track Title']
search_kw = search_kw.replace(' ','+')

html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_kw)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
print("https://www.youtube.com/watch?v=" + video_ids[0])

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 43: ordinal not in range(128)

Example of string that causes error:

Tutu Au Mic'  –  dumbéa

How can I convert the special characters into regular characters to prevent the error from occurring?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Anthony Reid
  • 89
  • 2
  • 9
  • Does this answer your question? [What is the best way to remove accents (normalize) in a Python unicode string?](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string) – mkrieger1 Jan 06 '22 at 21:05
  • What is the full traceback of the UnicodeEncodeError? – mkrieger1 Jan 06 '22 at 21:07
  • Probably the more appropriate solution is this: https://stackoverflow.com/questions/36395705/unicode-string-in-urllib-request – mkrieger1 Jan 06 '22 at 21:08

3 Answers3

1

Use the Unidecode library for this: https://pypi.org/project/Unidecode/, that guarantees a ascii string in return.

DrummerMann
  • 692
  • 4
  • 9
0

For a web query you probably need to use urlencode

urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus)

or for general character translations the string maketrans method

Python 3.9.5 (default, Nov 18 2021, 16:00:48) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> txt = "Tutu Au Mic' – dumbéa"
>>> mytable = txt.maketrans("é", "e")
>>> print(txt.translate(mytable))
Tutu Au Mic' – dumbea
>>> 

0

Instead of doing this, you should encode the non-ascii characters. Youtube will likely be able to understand what you mean with an ascii approximation, but not all characters have an ascii approximation. And it's not necessary, there are well defined ways to pass non-ascii characters in as part of a URL's query string.

The standard library offers urlib.parse.quote_plus for escaping text to be used as a query string. Or use the excellent requests library, https://docs.python-requests.org/en/latest/.

Peter DeGlopper
  • 36,326
  • 7
  • 90
  • 83