Using REGEX to match elements between lines in Python

Question

I'm looking to use REGEX to extract quantity out of a shopping website. In the following example, I want to get "12.5 kilograms". However, the quantity within the first span is not always listed in kilograms; it could be lbs., oz., etc.

        <td class="size-price last first" colspan="4">
                    <span>12.5 kilograms </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>

The code above is only a small portion of what is actually extracted using BeautifulSoup. Whatever the page is, the quantity is always within a span and is on a new line after

<td class="size-price last first" colspan="4">

I've used REGEX in the past but I am far from an expert. I'd like to know how to match elements between different lines. In this case between

<td class="size-price last first" colspan="4">

and

<span> <span class="strike">

This question appears to be off-topic because it is about parsing html with regex. — Hyperboreus, Mar 25 '14 at 03:38

score 1 · Accepted Answer · edited May 23 '17 at 10:26

Avoid parsing HTML with regex. Use the tool for the job, an HTML parser, like BeautifulSoup - it is powerful, easy to use and it can perfectly handle your case:

from bs4 import BeautifulSoup


data = """
<td class="size-price last first" colspan="4">
                    <span>12.5 kilograms </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>"""
soup = BeautifulSoup(data)

print soup.td.span.text

prints:

12.5 kilograms

Or, if the td is a part of a bigger structure, find it by class and get the first span's text out of it:

print soup.find('td', {'class': 'size-price'}).span.text

UPD (handling multiple results):

print [td.span.text for td in soup.find_all('td', {'class': 'size-price'})]

Hope that helps.

Thanks. Some pages contain more than one sizing option, resulting in multiple ` ...` With your code, I can print out the first sizing option appearing on the page. However, if I use `print soup.find_all('td', {'class': 'size-price'}).span.text` I get: `AttributeError: 'ResultSet' object has no attribute 'span'` — LaGuille, Mar 25 '14 at 03:48

Using REGEX to match elements between lines in Python

1 Answers1