Selenium - Crawling a Spanish Website - UTF-8

Question

I'm having troubles scrawling a website that use Spanish characters. I wrote the following code to generate the codes the website uses for its leagues:

LEAGUES = ['Internacional', 'Inglaterra', 'España', 'Francia', 'Alemania', 'Italia', 'Holanda', 'Portugal', 'Australia',
           'Bélgica', 'Egipto', 'Grecia', 'Omán', 'Irán', 'Japón', 'Kuwait', 'Marruecos', 'Arabia Saudí', 'Escocia', 'Turquía',
           'Irlanda del Norte', 'Dinamarca', 'Rusia', 'Emiratos Árabes', 'Gales', 'Túnez', 'Noruega', 'Suecia', 'Argelia', 'Israel']

def codes_generator():
    """
    generates dictionary containing codes for every division available
    """
    codes = defaultdict(dict)
    driver = selenium.webdriver.Chrome(executable_path='/media/Data.II/Dropbox/Projects/football-bidder/utils/chromedriver')
    driver.get('https://www.miljugadas.com/es-ES/sportsbook')
    driver.find_element_by_class_name('sport_240').click()
    for league in LEAGUES:
        try:
            league = driver.find_element_by_link_text(league)
            league.click()
        except selenium.common.exceptions.NoSuchElementException as e:
            continue
        divisions = league.find_element_by_xpath("parent::*").find_elements_by_tag_name('li')
        for division in divisions:
            division = division.find_element_by_tag_name('a')
            division_code = division.get_attribute('data-id')
            division_name = division.text
            codes[league.text][division_name] = division_code
    return codes


{u'B\xe9lgica': {u'B\xe9lgica - Jupiler League': u'52995'}, u'Espa\xf1a': {u'Espa\xf1a - Liga BBVA': u'23170', u'Espa\xf1a - Copa del Rey': u'67954'}, u'Kuwait': {u'Kuwait \u2013 Liga': u'128783'}, u'Holanda': {u'Holanda - Eredivisie': u'47282'}, u'Irlanda del Norte': {u'Irlanda del Norte - Premier': u'57274'} u'Grecia': {u'Grecia - Super Liga': u'53509'}}

It returns a dictionary that is a pain to manage. I can't traverse leagues like the Spain which uses special spanish characters.

Python 2. How to store everything with special characters. I want the dictionary keys to be Bélgica and España and not B\xe9lgica and Espa\xf1a — FranGoitia, Dec 24 '15 at 16:53

score 0 · Answer 1 · edited May 23 '17 at 10:28

0

It seems that your problem is about encoding. I would suggest you to:

declare explicitly the encoding used in your code with a coding comment

convert the Unicode [u 'string'] strings to a string, as it was done in this question :

es_string = "mañana"
es_string.encode("ascii")
es_string.encode("latin-1")
es_string.encode("utf-8")

edited May 23 '17 at 10:28

Community

1
1

answered Dec 24 '15 at 18:12

mabe02

2,676
2
20
35

It's good practice to only convert Unicodes to str() on output. I don't understand why you're trying to encode the Unicode with so many different codecs. `es_string.encode("ascii")` will fail for sure! – Alastair McCormack Dec 24 '15 at 22:30
I was just listing some example in order to let him choose the encoding he prefers.. it wasn't intended to be confusing! – mabe02 Dec 24 '15 at 23:02

score 0 · Answer 2 · answered Dec 24 '15 at 22:25

u'B\xe9lgica' is just the safe representation of a Unicode string. \xe9 == Unicode U+00E9 == é (http://www.fileformat.info/info/unicode/char/e9/index.htm).

If you were to print the Unicode objects to a compatible console then you'd see the correct characters.

You can also save the Unicode objects to a file using an encoding TextWrapper with the io module. This allows you to save it as UTF-8.

Here's an example of doing both:

with io.open("myoutfile.txt", "w", encoding="UTF-8") as my_file:
     for (league, division) in codes_generator().items():
         print league
         my_file.write(league)

Selenium - Crawling a Spanish Website - UTF-8

2 Answers2