Dealing with encoding - Webscraping using Python

Question

I try to parse some href in a webpage using python. To do so, I use the following code which performs quite well, but the href returned does not deal with accents properly. I tried different methods but they don't work.

Here is my code :

links = browser.find_elements_by_xpath(path)
for link in links:
    code = link.get_attribute("href")
    print (code)
    f.write(code + "\n")

For instance I've got this : "http//ww.blabla//Cl%C3%A9ment"
Instead of this : "http//ww.blabla//Clément"

that is a part of url and is not supposed to be removed. so better do not mess with it. — AmaanK, Jan 03 '21 at 16:57
may be this will help you https://stackoverflow.com/questions/16566069/url-decode-utf-8-in-python — foragerDev, Jan 03 '21 at 16:59

score 0 · Answer 1 · answered Jan 03 '21 at 17:36

Thanks Mohsan Ali,

I found an answer thanks to your link. Here is how it works :

links = browser.find_elements_by_xpath(path)
for link in links:
    code = link.get_attribute("href")
    code = urllib.parse.unquote(code)
    print (code)
    f.write(code + "\n")

I'm on Python 3 so using :

import urllib.parse
urllib.parse.unquote(url)

works fine !

Thanks very much for your quick help.

Dealing with encoding - Webscraping using Python

1 Answers1