1

I need to scrape only the textual content under the Reference in h3 at this URL, i'm trying with this code but i'm not able to get the text in the same order showed in the html page.

    i=43
    while tree.xpath('/html/body/form/table[3]/tr/td/table[5]/tr/td/table[1]/tr/td[2]//p['+str(i)+']/a/text()')!=[] :
        reference=tree.xpath('/html/body/form/table[3]/tr/td/table[5]/tr/td/table[1]/tr/td[2]//p['+str(i)+']/text()')
        link_ref=tree.xpath('/html/body/form/table[3]/tr/td/table[5]/tr/td/table[1]/tr/td[2]//p['+str(i)+']//a/text()')
        testo_reference=testo_reference + link_ref[0]+reference
        i= i+1

I'd like to return an array containing every single row under the reference without html tag but only with textual content.

Poggio
  • 131
  • 3
  • 9

1 Answers1

1

As suggested in comments, BeautifulSoup makes it insanely easy:

In [2]: from bs4 import BeautifulSoup

In [3]: import urllib2

In [4]: url = "http://www.dlib.org/dlib/november14/brook/11brook.html"

In [5]: soup = BeautifulSoup(urllib2.urlopen(url))

In [6]: for h3 in soup.find_all("h3"):
   ...:     print(h3.text)
   ...:     
D-Lib Magazine
The Social, Political and Legal Aspects of Text and Data Mining (TDM)
Abstract
1. Introduction
2. Copyright, database right, licences and TDM
3. Recent changes to UK law
4. What can politicians and policy makers do? 
5. Publishers are not embracing opportunities of TDM
6. How can publishers help TDM researchers?
7. Awareness among academics and a technological gap 
8. Conclusion
Notes
References
About the Authors
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195