-1

I'm working on some code using a web scraper in Python.

Given a website's source code, I need to extract relevant data points. The source code looks like this.

</sup>73.00</span> </td> </tr> <tr class="highlight"> <td><span class="data_lbl">Average</span></td> <td> <span class="data_data"><sup>
</sup>86.06</span> </td> </tr> <tr> <td><span class="data_lbl">Current Price</span></td> <td> <span class="data_data"><sup> </sup>83.20</span> </td>
 </tr> </tbody> </table> </div> </div> <!--data-module-name="quotes.module.researchratings.Module"--> </div> <div class="column at8-
col4 at16-col4 at12-col6" id="adCol"> <div intent in-at4units-prepend="#adCol" in-at8units-prepend="#adCol" in-at12units-prepend="#adCol

Here is the regex I'm using

regex = re.compile('Average*</sup>.....')

Which aims to get the 5 characters after the first "/sup" tag encountered after "Average", which in this case would be "86.06" (although I need to clean up the match before I'm left with just a float).

Is there a more elegant way of doing this that outputs the first float encountered after seeing the string "Average".

I'm very new to using regex and apologize if the question isn't clear enough.

ilim
  • 4,477
  • 7
  • 27
  • 46
td736
  • 23
  • 3
  • 1
    "Is there a more eleagant way of doing this" <- yes, use a html parser, don't summon Cthulhu. – timgeb Dec 12 '17 at 12:55
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Arne Dec 12 '17 at 13:00
  • Possibly answered here https://stackoverflow.com/questions/31638311/beautifulsoup-how-to-extract-text-after-specified-string – kmcodes Dec 12 '17 at 13:01
  • [Do. Not. Parse. HTML. With. Regex.](https://stackoverflow.com/a/1732454) – zwer Dec 12 '17 at 13:43

1 Answers1

1

I've been able to achieve that using lookbehind assertions combined with ungreedy search:

(?<=Average).*?(?<=<\/sup>)([0-9.]{5})

This working example here

Explanation

  • ([0-9.]{5}): look for 5 chars combining 0 to 9 and dot, after three following points.

    1. (?<=Average): the word Average must appear before
    2. .*?: any amount of chars between. Non-greedy (will match as less chars as possible)
    3. (?<=<\/sup>): the tag </sup> must appear before

The number you're looking for will be in the first capture group

alseether
  • 1,889
  • 2
  • 24
  • 39
  • BTW, i recently see [this](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) post, i think is mandatory reading – alseether Dec 13 '17 at 11:39