Parser returns wrong url

Question

I'm parsing dialect words from http://www.dialettando.com/dizionario/hitlist_regioni_new.lasso?regione=Sardegna.

from urllib import request  

from bs4 import BeautifulSoup
from nltk import corpus, word_tokenize, FreqDist, ConditionalFreqDist

url = 'http://www.dialettando.com/dizionario/hitlist_regioni_new.lasso?regione=Sardegna'
dialettando_tokens = []

while url:
    html = request.urlopen(url).read().decode('utf8')
    page = BeautifulSoup(html, 'html.parser')
    a_list = page.find_all('a')
    for a in a_list:
        try:
            a_str = str(a.contents[0])
            if a_str[:3] == '<b>' and a.contents[0].string:
                dialettando_tokens.append(a.contents[0].string.strip())
        except:
            pass

        if a.string == 'Simonelli Editore Srl':
            break
        elif a.string == 'PROSSIMI':
            link = a['href']
            url = 'http://www.dialettando.com/dizionario/' + link
            break
        else:
            url = ''

In the end of each iteration I need to parse url to the next page. HTML:

<a href="hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialetto&regione=Sardegna" class="titolinoverdone">PROSSIMI</a>

And I need to get this link:

'hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialetto&regione=Sardegna'

BUT the parser returns:

'hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialettoRione=Sardegna'

This link doesn't work correctly and I can't understand what's wrong.

It appears that ® is an html entity, same as ® meaning "Registered trademark". It appears to be replacing it with capital "R" — maxpolk, Jan 08 '16 at 23:53

score 1 · Accepted Answer · edited May 23 '17 at 10:28

An href needs to have the ampersand character escaped, see this question. It is possible the site you visited is not escaping the ampersand inside the href correctly, and hoping they never accidentally reference an HTML entity, except in your case they did. It seems like you have to parse buggy HTML, plus a parser that didn't notice the semicolon was missing and did the HTML entity conversion anyway.

Parser returns wrong url

1 Answers1