1

I'm looking to use REGEX to extract quantity out of a shopping website. In the following example, I want to get "12.5 kilograms". However, the quantity within the first span is not always listed in kilograms; it could be lbs., oz., etc.

        <td class="size-price last first" colspan="4">
                    <span>12.5 kilograms </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>

The code above is only a small portion of what is actually extracted using BeautifulSoup. Whatever the page is, the quantity is always within a span and is on a new line after

<td class="size-price last first" colspan="4">  

I've used REGEX in the past but I am far from an expert. I'd like to know how to match elements between different lines. In this case between

<td class="size-price last first" colspan="4">

and

<span> <span class="strike">
LaGuille
  • 1,658
  • 5
  • 20
  • 37

1 Answers1

1

Avoid parsing HTML with regex. Use the tool for the job, an HTML parser, like BeautifulSoup - it is powerful, easy to use and it can perfectly handle your case:

from bs4 import BeautifulSoup


data = """
<td class="size-price last first" colspan="4">
                    <span>12.5 kilograms </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>"""
soup = BeautifulSoup(data)

print soup.td.span.text

prints:

12.5 kilograms 

Or, if the td is a part of a bigger structure, find it by class and get the first span's text out of it:

print soup.find('td', {'class': 'size-price'}).span.text

UPD (handling multiple results):

print [td.span.text for td in soup.find_all('td', {'class': 'size-price'})]

Hope that helps.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks. Some pages contain more than one sizing option, resulting in multiple ` ...` With your code, I can print out the first sizing option appearing on the page. However, if I use `print soup.find_all('td', {'class': 'size-price'}).span.text` I get: `AttributeError: 'ResultSet' object has no attribute 'span'` – LaGuille Mar 25 '14 at 03:48