2

I am trying to crawl wordreference, but I am not succeding.

The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.

So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grúa.

This is my code:

import lxml.html as lh
import urllib2

url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[@class="ToWrd"]/text()')

for i in trans:

    print i

The result is that I get an empty list.

I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.

Thank you very much

aDoN
  • 1,877
  • 4
  • 39
  • 55

1 Answers1

1

It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.

Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):

import lxml.html as lh
import requests

url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'

response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)

trans = doc.xpath('//td[@class="ToWrd"]/text()')
for i in trans:
    print(i)

Prints:

grulla 
grúa 
plataforma 
...
grulla blanca 
grulla trompetera 
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thank you but, what is the reason it doesn't work with the `User Agent` `urllib`. I have crawled other websites that one without problems, why not this one¿? – aDoN Jan 19 '16 at 09:15