3

I am trying to parse some tables from a wiki page e.g. http://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014. there are four tables with same class name "wikitable". When I write:

movieList= soup.find('table',{'class':'wikitable'}) 
rows = movieList.findAll('tr')

It works fine, but when I write:

movieList= soup.findAll('table',{'class':'wikitable'})
rows = movieList.findAll('tr')

It throws an error:

Traceback (most recent call last):
  File "C:\Python27\movieList.py", line 24, in <module>
    rows = movieList.findAll('tr')
AttributeError: 'ResultSet' object has no attribute 'findAll'

when I print movieList it prints all four table.

Also, how can I parse the content effectively because the no. of columns in a row is variable? I want to store this information into different variables.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Ankit
  • 394
  • 1
  • 4
  • 16
  • Have you tried starting by `for my_table in moveList:`... to go over each table in turn... – Jon Clements Feb 03 '15 at 05:08
  • put `import pdb;pdb.set_trace()` before `row = ...` and then in pdb do a `dir(movieList)` to see what kind of things can be worked with from `movieList` ... I would presume that what you are looking for is `find_all`? What version of BS is being used? Your syntax suggests BS3, but if it doesn't have `findAll` then i am led to believe that you are using BS4? – jmunsch Feb 03 '15 at 05:09
  • What is pdb? and yes i am using bs4 python 2.7.9 – Ankit Feb 03 '15 at 05:14

1 Answers1

4

findAll() returns a ResultSet object - basically, a list of elements. If you want to find elements inside each of the element in the ResultSet - use a loop:

movie_list = soup.findAll('table', {'class': 'wikitable'})
for movie in movie_list:
    rows = movie.findAll('tr')
    ...

You could have also used a CSS Selector, but, in this case, it would not be easy to distinguish rows between movies:

rows = soup.select('table.wikitable tr')

As a bonus, here is how you can collect all of the "Releases" into a dictionary where the keys are the periods and the values are lists of movies:

from pprint import pprint
import urllib2
from bs4 import BeautifulSoup

url = 'http://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014'
soup = BeautifulSoup(urllib2.urlopen(url))

headers = ['Opening', 'Title', 'Genre', 'Director', 'Cast']
results = {}
for block in soup.select('div#mw-content-text > h3'):
    title = block.find('span', class_='mw-headline').text
    rows = block.find_next_sibling('table', class_='wikitable').find_all('tr')

    results[title] = [{header: td.text for header, td in zip(headers, row.find_all('td'))}
                      for row in rows[1:]]

pprint(results)

This should get you much closer to solving the problem.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • @Ankit and, by "messed up" you mean pretty-printed and structured in a dictionary? – alecxe Feb 03 '15 at 06:20
  • I mean that 'title' have director's name 'director' has cast and so on. because the no. of columns in each row is not fixed – Ankit Feb 03 '15 at 08:08