Problems crawling wordreference

Question

I am trying to crawl wordreference, but I am not succeding.

The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.

So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grúa.

This is my code:

import lxml.html as lh
import urllib2

url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[@class="ToWrd"]/text()')

for i in trans:

    print i

The result is that I get an empty list.

I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.

Thank you very much

score 1 · Accepted Answer · edited May 23 '17 at 12:04

1

It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.

Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):

import lxml.html as lh
import requests

url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'

response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)

trans = doc.xpath('//td[@class="ToWrd"]/text()')
for i in trans:
    print(i)

Prints:

grulla 
grúa 
plataforma 
...
grulla blanca 
grulla trompetera

edited May 23 '17 at 12:04

Community

1
1

answered Jan 18 '16 at 17:41

alecxe

462,703
120
1,088
1,195

Thank you but, what is the reason it doesn't work with the `User Agent` `urllib`. I have crawled other websites that one without problems, why not this one¿? – aDoN Jan 19 '16 at 09:15

Problems crawling wordreference

1 Answers1