0

I am trying to extract the names: "Harrisburg" & "Gujranwala" from the 2 pieces of code below:

<td><a href="/worldclock/city.html?n=97">Harrisburg</a><span id=p217s class=wds> *</span></td>
<td><a href="/worldclock/city.html?n=3551">Gujranwala</a><span id=p204s class=wds></span></td>

The Regex as of now doesn't work, how to fix it?

My Regex:

(?<=<td><a href="\/worldclock\/city\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span id=p[0-9]{0, 4}s class=wds>( \*)</span><\/td>) 

The regex is for python. Thanku

KingMak
  • 1,378
  • 2
  • 12
  • 26

4 Answers4

1
import re

city_html = """<td><a href="/worldclock/city.html?n=97">Harrisburg</a><span id=p217s class=wds> *</span></td>
               <td><a href="/worldclock/city.html?n=3551">Gujranwala</a><span id=p204s class=wds></span></td>"""

cities = re.findall(r'(?:city\.html.*?>)(.*?)(?:<)', city_html)
# cities == ['Harrisburg', 'Gujranwala']

What this RegEx is doing is looking for city.html ... > and grabbing everything after it until the next <.

mVChr
  • 49,587
  • 11
  • 107
  • 104
1

Depending on the variation of your original data, you don't need to specify the entire line, just the part around where you want to capture... The "active ingredient" is this part which captures all non-< characters after the opening tag... >([^<]+)<

import re
InLines = """<td><a href="/worldclock/city.html?n=97">Harrisburg</a><span id=p217s class=wds> *</span></td>\n<td><a href="/worldclock/city.html?n=3551">Gujranwala</a><span id=p204s class=wds></span></td>"""
Pattern = r'city\.html\?n=\d+">([^<]+)</a><span'
M = re.findall(Pattern, InLines)
print M
['Harrisburg', 'Gujranwala']
beroe
  • 11,784
  • 5
  • 34
  • 79
0

Try this regex :

([^>]*)<\s*/a\s*>
Emil Davtyan
  • 13,808
  • 5
  • 44
  • 66
  • I didn't downvote, but maybe they didn't like that this would match any href elsewhere in the data set? Hard to know how specific the OP's application would be, though, since they were pretty broad in their statement! – beroe Sep 30 '13 at 22:29
  • @beroe I guess, but writing regex is a hackish solution and making the regex too precise makes it easier to break. A simple regex can actually be more robust in these circumstances. – Emil Davtyan Sep 30 '13 at 22:37
0

You can't use lookbehinds unless the lookbehind subexpression has fixed length. This is because the regex engine needs to know where to start looking for a match. In this case, the [0-9]{0, 5} part means the regex can match strings of different lengths. (At least this is how it works in Perl.)

David Knipe
  • 3,417
  • 1
  • 19
  • 19