7

I am using BeautifulSoup 4 with Python 2.7. I would like to extract certain elements from a website (Quantities, see the example bellow). For some reason, the lxml parser doesn't allow me to extract all of the desired elements from the page. It would print the first three elements only. I am trying to use the html5lib parser to see if I can extract all of them.

The page contains multiple items, with their price and quantities. A portion of the code containing the desired information for each of the item looks like this:

<td class="size-price last first" colspan="4">
                    <span>453 grams </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>

Let's consider the following three cases:

CASE 1 - DATA:

#! /usr/bin/python
from bs4 import BeautifulSoup
data = """
<td class="size-price last first" colspan="4">
                    <span>453 grams </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>"""                
soup = BeautifulSoup(data)
print soup.td.span.text

Prints:

453 grams 

CASE 2 - LXML:

#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('The URL goes here')
soup=BeautifulSoup(webpage, "lxml")
print soup.find('td', {'class': 'size-price'}).span.text

Prints:

453 grams

CASE 3 - HTML5LIB:

#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('The URL goes here')
soup=BeautifulSoup(webpage, "html5lib")
print soup.find('td', {'class': 'size-price'}).span.text

I get the following error:

Traceback (most recent call last):
  File "C:\Users\Dom\Python-Code\src\Testing-Code.py", line 6, in <module>
    print soup.find('td', {'class': 'size-price'}).span.text
AttributeError: 'NoneType' object has no attribute 'span'

How do I have to adapt my code in order to extract the information that I want using the html5lib parser? I can see all of the desired information if I simply print the soup in the console after using the html5lib, so I figured it would allow me to get what I want. It is not the case for the lxml parser so I am also curious about the fact that the lxml parser doesn't seem to extract all of the Quantities using the lxml parser if I use:

print [td.span.text for td in soup.find_all('td', {'class': 'size-price'})]
LaGuille
  • 1,658
  • 5
  • 20
  • 37
  • 1
    `html5lib` omits the `td` tag and puts everything inside the html body - this is because there is no `table` tag around the `td` and `html5lib` is concerned about it. – alecxe Mar 27 '14 at 19:36
  • Interesting, so now how should I proceed in order to extract the elements that I want using html5lib – LaGuille Mar 27 '14 at 20:02
  • Well, why do you want to use `html5lib`? FYI, you can also make use of `html.parser`, e.g.: `BeautifulSoup(webpage, 'html.parser')`. – alecxe Mar 27 '14 at 20:07
  • RuntimeWarning: Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help. `"Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser.` – LaGuille Mar 27 '14 at 20:12
  • Gotcha :) Can I see the whole html document (or link) you are trying to parse? – alecxe Mar 27 '14 at 20:13
  • I'd rather keep it private if you don't mind. I know it can't really help you much if I don't give you the specific URL... Let's say I use REGEX to compile the prices, `soup=BeautifulSoup(webpage, "lxml") FindPrice = re.compile('\$\d+\.\d{2}') print re.findall(FindPrice, str(soup))` It won't print all of the prices on the page. Using html5lib will. Same thing with the quanitities. This is why I want to use html5lib. I'm just looking for a way to use BeautifulSoup with html5lib like I do with lxml since scraping with REGEX isn't recommended... – LaGuille Mar 27 '14 at 20:21

2 Answers2

0
from lxml import etree

html = 'your html'
tree = etree.HTML(html)
tds = tree.xpath('.//td[@class="size-price last first"]')
for td in tds:
    price = td.xpath('.//span[@class="price"]')[0].text
    strike = td.xpath('.//span[@class="strike"]')[0].text
    spans = td.xpath('.//span')
    quantity = [i.text for i in spans if 'grams' in i.text][0].strip(' ')
AutomaticStatic
  • 1,661
  • 3
  • 21
  • 42
-1

Try the below:

    from bs4 import BeautifulSoup
    data = """
    <td class="size-price last first" colspan="4">
                <span>453 grams </span>
        <span> <span class="strike">$619.06</span> <span 
    class="price">$523.91</span>
                </span>
            </td>"""                
    soup = BeautifulSoup(data)
    text = soup.get_text(strip=True)
    print text