2

I am completely puzzled by the behavior of the following HTML-scraping code that I wrote in two different environments and need help finding the root cause of this discrepancy.

import sys
import bs4
import md5
import logging
from urllib2 import urlopen
from platform import platform

# Log particulars of the environment
logging.warning("OS platform is %s" %platform())
logging.warning("Python version is %s" %sys.version)
logging.warning("BeautifulSoup is at %s and its version is %s" %(bs4.__file__, bs4.__version__))

# Open web-page and read HTML
url = 'http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=JXIG&size=all'
response = urlopen(url)
html = response.read()

# Calculate MD5 to ensure that the same string was downloaded
print "MD5 sum for html string downloaded is %s" %md5.new(html).hexdigest()

# Make beautiful soup
soup = bs4.BeautifulSoup(html, 'html')
contigsTable = soup.find("table", {"class" : "zebra"})
contigs = []

# Parse table in soup to find all records
for row in contigsTable.findAll('tr'):
    column = row.findAll('td')
    if len(column) > 2:
        contigs.append(column[1])

# Expect identical results on any machine that this is run
print "Number of contigs identified is %s" %len(contigs)

On machine 1, this runs to return:

WARNING:root:OS platform is Linux-3.10.10-031010-generic-x86_64-with-Ubuntu-12.04-precise   
WARNING:root:Python version is 2.7.3 (default, Jun 22 2015, 19:33:41)  
[GCC 4.6.3]  
WARNING:root:BeautifulSoup is at /usr/local/lib/python2.7/dist-packages/bs4/__init__.pyc and its version is 4.3.2  
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf  

Number of contigs identified is 630  

On machine 2, this very identical code runs to return:

WARNING:root:OS platform is Linux-2.6.32-431.46.2.el6.nersc.x86_64-x86_64-with-debian-6.0.6
WARNING:root:Python version is 2.7.4 (default, Apr 17 2013, 10:26:13) 
[GCC 4.6.3]
WARNING:root:BeautifulSoup is at /global/homes/i/img/.local/lib/python2.7/site-packages/bs4/__init__.pyc and its version is 4.3.2
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf

Number of contigs identified is 462

The number of contigs calculated is different. Please note that the same code parses an HTML table to yield different results on two different environments that are not strikingly different from each other and unfortunately leading to this production nightmare. Manual inspection confirms that the results returned on Machine 2 are incorrect, but has so far been impossible to explain.

Does anyone have similar experience? Do you notice anything wrong with this code or should I stop trusting BeautifulSoup altogether?

Spade
  • 2,220
  • 1
  • 19
  • 29

1 Answers1

5

You are experiencing the differences between parsers that BeaufitulSoup chooses automatically for the "html" markup type you've specified. Which parser is picked up depends on what modules are available in the current Python environment:

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

To have a consistent behavior across the platforms, be explicit:

soup = BeautifulSoup(html, "html.parser")
soup = BeautifulSoup(html, "html5lib")
soup = BeautifulSoup(html, "lxml")

See also: Installing a parser.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks for this answer. Your explanation is very sound and consistent with documentation. What I am struggling to understand is that any parser would mess up something as basic as counting the number of rows in an HTML table. And the behavior is not consistent across different documents of the same type i.e. there are many cases for which the count of contigs are consistent which lead to this go undetected. Thanks so much! – Spade Sep 18 '15 at 05:55
  • 1
    @Spade yes, with non-well-formed HTML this is basically a trial&error game - some of the parsers are less or more lenient and interpret the broken HTML differently. You should probably choose the parser that extract you the desired results and stick to it if possible. There are though other approaches to tackle the broken and inconsistent HTML like letting a real browser render a page and "fix" the markup and then proceed to data extraction..thanks! – alecxe Sep 18 '15 at 06:13
  • I have a parser but behavior still inconsistent. i had to do a while loop and parse it more than once to get consistent behavior. – Denise Michelle del Bando Jun 04 '21 at 15:48