0

I'm currently working on a webscraper without any frameworks and experiencing an issue where I test an xpath xpression to, say, get the table data on a wikipedia page. However when I scrape it and print it to the console it only returns an empty list. Can anyone please advise? and perhaps suggest some useful books on xpath for webscraping? (i have safaribooks of that helps)

import requests
from lxml import html

page = requests.get('https://en.wikipedia.org/wiki/L.A.P.D._(band)')
tree = html.fromstring(page.content)

# OK
bandName = tree.xpath('//*[@id="firstHeading"]/text()')
overview = tree.xpath('//*[@id="mw-content-text"]/p[1]//text()')
print(bandName)
print(overview)


#Trouble Code
yearsActive = tree.xpath('//*[@id="mw-content-text"]/table[1]/tbody/tr[6]//text()')
print(yearsActive)
members = tree.xpath('//*[@id="mw-content-text"]/table[1]/tbody/tr[11]/td[1]/ul/li/a//text()')
print(members)

UPDATE: While Conducting more testing I discovered that print(len(members)) returns zero which seems to indicate something is wrong with my xpath expression, yet when testing my members expression in chrome console it returns a list of band members.

user502301
  • 43
  • 1
  • 7
  • Is there a reason why you're processing the HTML instead of the actual data for the page? – Ignacio Vazquez-Abrams Apr 24 '16 at 07:05
  • I'm not sure how to to 'process the actual data'. I am very new to working with xpath and scraping in general. Could you please explain how I can process the actual data? – user502301 Apr 24 '16 at 07:07
  • **yearsActive** and **members** are only empty. Do you mean all variables are empty @user502301 – wrufesh Apr 24 '16 at 07:09
  • @user502301 No, Only yearsActive and members are coming back empty. The other two variables are working. I have updated my code submission to better sort the code that's not working from the code that is working. – user502301 Apr 24 '16 at 07:15
  • @Ignacio-Vazquez-Abrams I'm not sure how to to 'process the actual data'. I am very new to working with xpath and scraping in general. Could you please explain how I can process the actual data? – user502301 Apr 24 '16 at 07:19

1 Answers1

1

Your XPath fails because the raw HTML tables don't have tbody. The tbody elements in this case are likely generated by browser (see related question below) :

>>> yearsActive = tree.xpath('//*[@id="mw-content-text"]/table[1]/tr[6]/td/text()')
>>> print yearsActive
[u'1989\u20131992']
>>> members = tree.xpath('//*[@id="mw-content-text"]/table[1]/tr[10]/td[1]//text()[normalize-space()]')
>>> print members
['James Shaffer', 'Reginald Arvizu', 'David Silveria', '\nRichard Morrill', '\nPete Capra', '\nCorey (surname unknown)', '\nDerek Campbell', '\nTroy Sandoval', '\nJason Torres', '\nKevin Guariglia']

In the future, it is often useful to inspect HTML that you actually receives from requests.get(), in case your XPath unexpectedly fails when run from codes but the same worked fine when run from browser tools.

Related : Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?

Community
  • 1
  • 1
har07
  • 88,338
  • 12
  • 84
  • 137
  • Hey @har07, Do you suggest a particular way to do that? Thank you for your help! – user502301 Apr 24 '16 at 08:05
  • I don't have particular approach to web scraping that I can suggest. I'd just suggest learn XPath, I found it powerful enough to locate any part of an HTML. – har07 Apr 24 '16 at 08:13