HTML parsing with python regular expression

Question

I am using python regular expression to parse html file, now I need to extract a number from a html tag, the number can be either integer or floating point value. Following are two examples:

integer case:

<span class='addr-bbs'>2 baths</span>

floating point case:

<span class='addr-bbs'>3.5 baths</span>

My original code is:

bath = re.findall('<span class=\"addr_bbs\">' + '(.{1,3})' + 'baths{0,1}<', str(homedata))

But after testing, it misses all the floating point cases. How can I cover both cases to extract the number correctly?

Thanks

Please don't parse HTML with regex, it's gonna hurt you. You're using Python already, why not use BeautifulSoup? https://www.crummy.com/software/BeautifulSoup/bs4/doc/ — 1sloc, Jul 11 '16 at 19:50
Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Two-Bit Alchemist, Jul 11 '16 at 19:51

Padraic Cunningham · Accepted Answer · 2016-07-11T20:01:49.230

As commented, use a html parser to find the tags by class name. If the number is always the first in the text you can just split to extract it once you have the tag:

from bs4 import BeautifulSoup
h = """<span class='addr-bbs'>3.5 baths</span>
      <span class='addr-bbs'>1 baths</span>
      <span class='foos'>3.0 baths</span>"""

soup = BeautifulSoup(h,"html.parser")

for span in soup.select("span.addr-bbs"):
    print(span.text.split()[0])

Which would print:

3.5
1

If you want to also filter by the tag text, i.e there are other spans with the addr-bbs, you can pass a regex to find_all to only get the span.addr-bbs that have the word baths.

from bs4 import BeautifulSoup
import re
h = """<span class='addr-bbs'>3.5 baths</span>
"<span class='addr-bbs'>5 rooms</span>
      <span class='addr-bbs'>1 baths</span>
      <span class='foos'>3.0 baths</span>"""

soup = BeautifulSoup(h, "html.parser")

for span in soup.find_all("span","addr-bbs", text=re.compile(r"\bbaths\b")):
    print(span.text.split()[0])

You are probably right, regx may not be a good idea in a long terms. I need to redo the whole thing with BeautifulSoup. — DQI, Jul 12 '16 at 16:24

logi-kal · Answer 2 · 2016-07-11T20:04:38.163

0

Three typos:

the inverted commas;
the dash;
the space.

Try with bath = re.findall('''<span class=["']addr-bbs["']>''' + '(.{1,3})' + ' baths{0,1}<', str(homedata))

edited Jul 11 '16 at 20:04

answered Jul 11 '16 at 19:59

logi-kal

7,107
6
31
43

score 0 · Answer 3 · answered Jul 11 '16 at 20:01

First, realize you are somewhat doomed without more processing. Some realtors will write "2.5", others "2 1/2", others "2+1/2", and so on. MLS data has never normalized, in part to make it difficult to parse. Just when you think you have it solved, you get "2+sink". It's usually permissible to guess the numeric meaning for searches and then spit out the original text when its displayed.

You should probably grab everything from the > to baths. To do this correctly, you should use the "non-greedy" modify, so that you don't parse all the way down to the next record. You can read non-greedy in thsi Python doc, but the magic phrase is:

bath = re.findall('<span class=\"addr_bbs\">(.*?)bath.?<', str(homedata))

Then try to parse bath.groups() best you can.

HTML parsing with python regular expression

3 Answers3

Linked