1

I scraped some data from a website but it does not contain the sections I need. The section is at the lower half of the website and I want to scrape the name, date, protest location, age, current whereabouts, info, and the news link.

I started with the "name" first but it did not contain the h2 tags. Upon closer inspection using soup.prettify, I found that the page ends some lines above the section I need. I read that scrappers have failed due to jquery or javascript but I do not see such issue here.

Thanks in advance for your help.

import requests
import bs4

root_url = 'http://www.savetibet.org'
index_url = root_url + '/resources/fact-sheets/self-immolations-by-tibetans/'

def get_names_age():
    response = requests.get(index_url)
    soup = bs4.BeautifulSoup(response.text)
    print(soup.prettify())

    '''
    name_list = soup.find('div', {'class': 'entry'})
    for name in name_list:
        try:
            print(name.h2.text)
       except AttributeError:
            continue
    '''
get_names_age()
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • possible duplicate of [Missing parts on Beautiful Soup results](http://stackoverflow.com/questions/18614305/missing-parts-on-beautiful-soup-results) – Martijn Pieters Mar 05 '15 at 15:09
  • The HTML is broken; use a different parser to get different interpretations on how to repair it, see the duplicate. LXML does find the sections. – Martijn Pieters Mar 05 '15 at 15:10
  • Also note: you should use `response.content` and leave decoding to BeautifulSoup. *Here* there are no problems with the codec that `requests` picked, but if the server set no content type the default is Latin-1 and that is often wrong. BeautifulSoup does a better job of divining the content type. See [my answer here](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup/22583436#22583436) for a better approach still. – Martijn Pieters Mar 05 '15 at 15:15
  • Thanks Martijn! I will look into the decoding issue right away. I tried LXML already with no success. – everestbaker Mar 05 '15 at 15:20
  • I had no problems retrieving the sections using LXML 3.4.2 (using libxml2 2.9.0) on Mac. html5lib also finds the entries (141). – Martijn Pieters Mar 05 '15 at 15:22
  • I have seen reports that the [Windows build of lxml from Gohlke](http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml) use libxml2 2.9.2 and are broken when it comes to HTML parsing. – Martijn Pieters Mar 05 '15 at 15:24
  • Solved it using the html5lib parser - `soup = bs4.BeautifulSoup(response.text, 'html5lib')` Make sure to install the html5lib library first. Thanks for the quick help Martijn. – everestbaker Mar 05 '15 at 16:04

0 Answers0