Fix regex to extract city names from HTML

Question

I am trying to extract the names: "Harrisburg" & "Gujranwala" from the 2 pieces of code below:

<td><a href="/worldclock/city.html?n=97">Harrisburg</a><span id=p217s class=wds> *</span></td>
<td><a href="/worldclock/city.html?n=3551">Gujranwala</a><span id=p204s class=wds></span></td>

The Regex as of now doesn't work, how to fix it?

My Regex:

(?<=<td><a href="\/worldclock\/city\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span id=p[0-9]{0, 4}s class=wds>( \*)</span><\/td>)

The regex is for python. Thanku

possible duplicate of [Extracting text from HTML file using Python](http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python) — Dave Jarvis, Sep 30 '13 at 22:30

score 1 · Accepted Answer · answered Sep 30 '13 at 22:18

import re

city_html = """<td><a href="/worldclock/city.html?n=97">Harrisburg</a><span id=p217s class=wds> *</span></td>
               <td><a href="/worldclock/city.html?n=3551">Gujranwala</a><span id=p204s class=wds></span></td>"""

cities = re.findall(r'(?:city\.html.*?>)(.*?)(?:<)', city_html)
# cities == ['Harrisburg', 'Gujranwala']

What this RegEx is doing is looking for city.html ... > and grabbing everything after it until the next <.

This made the most sense to me, Thanku Sir – KingMak Sep 30 '13 at 22:32 — KingMak, Sep 30 '13 at 22:32

score 1 · Answer 2 · answered Sep 30 '13 at 22:20

Depending on the variation of your original data, you don't need to specify the entire line, just the part around where you want to capture... The "active ingredient" is this part which captures all non-< characters after the opening tag... >([^<]+)<

import re
InLines = """<td><a href="/worldclock/city.html?n=97">Harrisburg</a><span id=p217s class=wds> *</span></td>\n<td><a href="/worldclock/city.html?n=3551">Gujranwala</a><span id=p204s class=wds></span></td>"""
Pattern = r'city\.html\?n=\d+">([^<]+)</a><span'
M = re.findall(Pattern, InLines)
print M
['Harrisburg', 'Gujranwala']

score 0 · Answer 3 · answered Sep 30 '13 at 22:17

0

Try this regex :

([^>]*)<\s*/a\s*>

answered Sep 30 '13 at 22:17

Emil Davtyan

13,808
5
44
66

I didn't downvote, but maybe they didn't like that this would match any href elsewhere in the data set? Hard to know how specific the OP's application would be, though, since they were pretty broad in their statement! – beroe Sep 30 '13 at 22:29
@beroe I guess, but writing regex is a hackish solution and making the regex too precise makes it easier to break. A simple regex can actually be more robust in these circumstances. – Emil Davtyan Sep 30 '13 at 22:37

score 0 · Answer 4 · answered Sep 30 '13 at 22:23

You can't use lookbehinds unless the lookbehind subexpression has fixed length. This is because the regex engine needs to know where to start looking for a match. In this case, the [0-9]{0, 5} part means the regex can match strings of different lengths. (At least this is how it works in Perl.)

Fix regex to extract city names from HTML

4 Answers4