I'm parsing dialect words from http://www.dialettando.com/dizionario/hitlist_regioni_new.lasso?regione=Sardegna.
from urllib import request
from bs4 import BeautifulSoup
from nltk import corpus, word_tokenize, FreqDist, ConditionalFreqDist
url = 'http://www.dialettando.com/dizionario/hitlist_regioni_new.lasso?regione=Sardegna'
dialettando_tokens = []
while url:
html = request.urlopen(url).read().decode('utf8')
page = BeautifulSoup(html, 'html.parser')
a_list = page.find_all('a')
for a in a_list:
try:
a_str = str(a.contents[0])
if a_str[:3] == '<b>' and a.contents[0].string:
dialettando_tokens.append(a.contents[0].string.strip())
except:
pass
if a.string == 'Simonelli Editore Srl':
break
elif a.string == 'PROSSIMI':
link = a['href']
url = 'http://www.dialettando.com/dizionario/' + link
break
else:
url = ''
In the end of each iteration I need to parse url to the next page. HTML:
<a href="hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialetto®ione=Sardegna" class="titolinoverdone">PROSSIMI</a>
And I need to get this link:
'hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialetto®ione=Sardegna'
BUT the parser returns:
'hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialettoRione=Sardegna'
This link doesn't work correctly and I can't understand what's wrong.