2

I'm new to regular expression/Python, but I'm trying to extract a revision number from an HTML page. I used a proxy and urllib to store the read page into a string. I have some text that looks like:

<p>Proxy 3.2.1 r72440<br>
SlotBios 11.00</p>
<p><strong><span style="color: rgb(255, 0, 0);">Random Text 4.23.6 r98543<br>
...</tr>...
<p><strong><span style="color: rgb(255, 0, 0);">Random Text 4.33.6 r98549<br>

I want to parse the text and extract the revision numbers corresponding to lines of red. So in this example, I want to parse out 98543 and 98549.

I'm able to parse out all the lines generally with:

paragraphs = re.findall(r'r(\d*)<br>',str(html))

However, I'm a little stuck on how to do it such that I can find only the red lines. My current code would also include 72440. Any idea how to get around this? Thanks!

Varun Behl
  • 21
  • 2
  • Does the revision numbers have always the same amount of characters? Maybe you should try to cut the string from the back instead. – LordNeo Jun 08 '16 at 20:34
  • The regex engine is colourblind. It can't tell which colour your lines would be rendered in a web browser. Is there some other clue you can use to identify the numbers you are looking for? – Håken Lid Jun 08 '16 at 20:35
  • [don't use regex to parse html](http://stackoverflow.com/a/1732454/5323213) – R Nar Jun 08 '16 at 20:40
  • @RNar see the answer directly below your linked answer. – Jed Fox Jun 08 '16 at 20:46
  • The revision numbers could have 5 or 6 characters. I'm a little limited to libraries included with a Linux installation, so I haven't been able to use BeautifulSoup. I have been trying (and failing) with the second answer though, but I'll keep working along those lines. I also realized I might be able to cut out all the lines including rgb and then do a simple re findall statement. Thanks for all the responses! – Varun Behl Jun 09 '16 at 14:09

2 Answers2

1

You need to use a HTML parser to help you filter out the tags that have the red color applied, then use your regular expression on the tag's contents:

>>> from bs4 import BeautifulSoup
>>> html = ''' (your html here) '''
>>> parser = BeautifulSoup(html, 'html.parser')
>>> for span_tag in parser.find_all('span', style='color: rgb(255, 0, 0);'):
...  print(span_tag.text)

Random Text 4.23.6 r98543

You can then collect all the text, and run your regular expression over it to filter out the version numbers:

>>> t = [i.text for i in parser.find_all('span', style='color: rgb(255, 0, 0);')] 
Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
0

If you know you're only looking for lines that contain the pattern color: rgb(255, 0, 0), then add that pattern to your regexp:

paragraphs = re.findall(r'color: rgb\(255, 0, 0\).*r(\d*)<br>',str(html))
John Gordon
  • 29,573
  • 7
  • 33
  • 58