How can I use regular expression/Python to find all integers after a known string, an unknown string, and another known string?

Question

I'm new to regular expression/Python, but I'm trying to extract a revision number from an HTML page. I used a proxy and urllib to store the read page into a string. I have some text that looks like:

<p>Proxy 3.2.1 r72440<br>
SlotBios 11.00</p>
<p><strong><span style="color: rgb(255, 0, 0);">Random Text 4.23.6 r98543<br>
...</tr>...
<p><strong><span style="color: rgb(255, 0, 0);">Random Text 4.33.6 r98549<br>

I want to parse the text and extract the revision numbers corresponding to lines of red. So in this example, I want to parse out 98543 and 98549.

I'm able to parse out all the lines generally with:

paragraphs = re.findall(r'r(\d*)<br>',str(html))

However, I'm a little stuck on how to do it such that I can find only the red lines. My current code would also include 72440. Any idea how to get around this? Thanks!

Does the revision numbers have always the same amount of characters? Maybe you should try to cut the string from the back instead. — LordNeo, Jun 08 '16 at 20:34
The regex engine is colourblind. It can't tell which colour your lines would be rendered in a web browser. Is there some other clue you can use to identify the numbers you are looking for? — Håken Lid, Jun 08 '16 at 20:35
[don't use regex to parse html](http://stackoverflow.com/a/1732454/5323213) — R Nar, Jun 08 '16 at 20:40
The revision numbers could have 5 or 6 characters. I'm a little limited to libraries included with a Linux installation, so I haven't been able to use BeautifulSoup. I have been trying (and failing) with the second answer though, but I'll keep working along those lines. I also realized I might be able to cut out all the lines including rgb and then do a simple re findall statement. Thanks for all the responses! — Varun Behl, Jun 09 '16 at 14:09

score 1 · Answer 1 · answered Jun 08 '16 at 20:42

You need to use a HTML parser to help you filter out the tags that have the red color applied, then use your regular expression on the tag's contents:

>>> from bs4 import BeautifulSoup
>>> html = ''' (your html here) '''
>>> parser = BeautifulSoup(html, 'html.parser')
>>> for span_tag in parser.find_all('span', style='color: rgb(255, 0, 0);'):
...  print(span_tag.text)

Random Text 4.23.6 r98543

You can then collect all the text, and run your regular expression over it to filter out the version numbers:

>>> t = [i.text for i in parser.find_all('span', style='color: rgb(255, 0, 0);')]

score 0 · Answer 2 · answered Jun 08 '16 at 20:33

0

If you know you're only looking for lines that contain the pattern color: rgb(255, 0, 0), then add that pattern to your regexp:

paragraphs = re.findall(r'color: rgb\(255, 0, 0\).*r(\d*)<br>',str(html))

answered Jun 08 '16 at 20:33

John Gordon

29,573
7
33
58

How can I use regular expression/Python to find all integers after a known string, an unknown string, and another known string?

2 Answers2