Python - Using wildcard string matching to extract float from a website's source code

Question

I'm working on some code using a web scraper in Python.

Given a website's source code, I need to extract relevant data points. The source code looks like this.

</sup>73.00</span> </td> </tr> <tr class="highlight"> <td><span class="data_lbl">Average</span></td> <td> <span class="data_data"><sup>
</sup>86.06</span> </td> </tr> <tr> <td><span class="data_lbl">Current Price</span></td> <td> <span class="data_data"><sup> </sup>83.20</span> </td>
 </tr> </tbody> </table> </div> </div> <!--data-module-name="quotes.module.researchratings.Module"--> </div> <div class="column at8-
col4 at16-col4 at12-col6" id="adCol"> <div intent in-at4units-prepend="#adCol" in-at8units-prepend="#adCol" in-at12units-prepend="#adCol

Here is the regex I'm using

regex = re.compile('Average*</sup>.....')

Which aims to get the 5 characters after the first "/sup" tag encountered after "Average", which in this case would be "86.06" (although I need to clean up the match before I'm left with just a float).

Is there a more elegant way of doing this that outputs the first float encountered after seeing the string "Average".

I'm very new to using regex and apologize if the question isn't clear enough.

"Is there a more eleagant way of doing this" <- yes, use a html parser, don't summon Cthulhu. — timgeb, Dec 12 '17 at 12:55
Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Arne, Dec 12 '17 at 13:00
Possibly answered here https://stackoverflow.com/questions/31638311/beautifulsoup-how-to-extract-text-after-specified-string — kmcodes, Dec 12 '17 at 13:01
[Do. Not. Parse. HTML. With. Regex.](https://stackoverflow.com/a/1732454) — zwer, Dec 12 '17 at 13:43

alseether · Accepted Answer · 2017-12-12T15:22:35.583

1

I've been able to achieve that using lookbehind assertions combined with ungreedy search:

(?<=Average).*?(?<=<\/sup>)([0-9.]{5})

This working example here

Explanation

([0-9.]{5}): look for 5 chars combining 0 to 9 and dot, after three following points.
1. (?<=Average): the word Average must appear before
2. .*?: any amount of chars between. Non-greedy (will match as less chars as possible)
3. (?<=<\/sup>): the tag </sup> must appear before

The number you're looking for will be in the first capture group

edited Dec 12 '17 at 15:22

answered Dec 12 '17 at 13:26

alseether

1,889
2
24
39

BTW, i recently see [this](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) post, i think is mandatory reading – alseether Dec 13 '17 at 11:39

Python - Using wildcard string matching to extract float from a website's source code

1 Answers1