1

I know many people encountered this error before but I couldn't find the solution to my problem.

I have a URL that I want to normalize:

url = u"http://www.dgzfp.de/Dienste/Fachbeitr%C3%A4ge.aspx?EntryId=267&Page=5"
scheme, host_port, path, query, fragment = urlsplit(url)
path = urllib.unquote(path)
path = urllib.quote(path,safe="%/")

This gives an error message:

/usr/lib64/python2.6/urllib.py:1236: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  res = map(safe_map.__getitem__, s)
Traceback (most recent call last):
  File "url_normalization.py", line 246, in <module>
    logging.info(get_canonical_url(url))
  File "url_normalization.py", line 102, in get_canonical_url
    path = urllib.quote(path,safe="%/")
  File "/usr/lib64/python2.6/urllib.py", line 1236, in quote
    res = map(safe_map.__getitem__, s)
KeyError: u'\xc3'

I tried to remove the unicode indicator "u" from the URL string and I do not get the error message. But How can I get rid of the unicode automatically because I read it directly from a database.

fanchyna
  • 2,623
  • 7
  • 36
  • 38
  • Possible duplicate of [Python unicode equal comparison failed](https://stackoverflow.com/questions/18193305/python-unicode-equal-comparison-failed) – Alastair Irvine Jul 26 '17 at 10:15

1 Answers1

5

urllib.quote() does not properly parse Unicode. To get around this, you can call the .encode() method on the url when reading it (or on the variable you read from the database). So run url = url.encode('utf-8'). With this you get:

import urllib
import urlparse
from urlparse import urlsplit

url = u"http://www.dgzfp.de/Dienste/Fachbeitr%C3%A4ge.aspx?EntryId=267&Page=5"
url = url.encode('utf-8')
scheme, host_port, path, query, fragment = urlsplit(url)
path = urllib.unquote(path)
path = urllib.quote(path,safe="%/")

and then your output for the path variable will be:

>>> path
'/Dienste/Fachbeitr%C3%A4ge.aspx'

Does this work?

Jason B
  • 7,097
  • 8
  • 38
  • 49
  • This works for me. Thank you. Actually, I printed type(url) before and after encoding and found that it is before encoding and after encoding. – fanchyna Jan 30 '15 at 14:50