1

Im using soup.findAll('table') to try to find the table in an html file, but it will not appear. The table indeed exists in the file, and with regex Im able to locate it this way:

import sys
import urllib2
from bs4 import BeautifulSoup
import re
webpage = open(r'd:\samplefile.html', 'r').read()
soup = BeautifulSoup(webpage)
print re.findall("TABLE",webpage)   #works, prints ['TABLE','TABLE']
print soup.findAll("TABLE")   # prints an empty list []

I know I am correctly generating the soup since when I do:

print [tag.name for tag in soup.findAll(align=None)]

It will correctly print tags that it finds. I already tried also with different ways to write "TABLE" like "table", "Table", etc. Also, if I open the file and edit it with a text editor, it has "TABLE" on it.

Why beautifulsoup doesnt find the table??

dreftymac
  • 31,404
  • 26
  • 119
  • 182
I want badges
  • 6,155
  • 5
  • 23
  • 38
  • 1
    Could you post a sample of this html file? – mr2ert Oct 31 '13 at 14:59
  • I have the same problem. Trying to scrape from ESPN.com. ` url = 'http://scores.espn.go.com/nfl/boxscore?gameId=331010003' boxurl = urllib2.urlopen(url).read() soup = BeautifulSoup(boxurl) soupTables = soup.findAll('table') reTables = re.findall('table', boxurl) print len(soupTables), len(reTables) ` soup.findAll only returns 1 table while re.findall finds 46 tables – user2333196 Oct 31 '13 at 16:21
  • @user2333196 After adding `http://` to the url (so I could actually download it), `len(soup.findAll('table'))` returned 23 for me. – mr2ert Oct 31 '13 at 17:13
  • @mr2ert yes, you can find it here: http://jsbin.com/EjaqegU/3/watch?html,output – I want badges Oct 31 '13 at 19:03
  • I copied your html in a local file on my computer, and run your code in ipython, it works fine, I get 1 table. Have you tried to run it in ipython on the side, with the very minimum code? – nnaelle Nov 06 '13 at 21:09

1 Answers1

1

Context

  • python 2.x
  • BeautifulSoup HTML parser

Problem

  • bsoup findall does not return all the expected tags, or it returns none at all, even though the user knows that the tag exists in the markup

Solution

  • Try specifying the exact parser when initializing the BeautifulSoup constructor
## BEFORE
soup = BeautifulSoup(webpage)

## AFTER
soup = BeautifulSoup(webpage, "html5lib")

Rationale

  • The target markup may include mal-formed HTML, and there are varying degrees of success with different parsers.

See also

dreftymac
  • 31,404
  • 26
  • 119
  • 182