python regex: extract contents of an HTML element

Question

I have elements in an HTML page in this format:

<td class="cell1"><b>Dave Mason's Traffic Jam</b></td><td class="cell2">Scottish Rite
Auditorium</td><td class="cell3">$29-$45</td><td class="cell4">On sale now</td><td class="cell5"><a 
href="http://www.ticketmaster.com/dave-masons-traffic-jam-collingswood-new-jersey-11-29-2014/event  
/02004B48C416D202?artistid=1033927&majorcatid=10001&minorcatid=1&tm_link=venue_msg-
1_02004B48C416D202" target="_blank">TIX</a></td><td class="cell6">AA</td><td 
class="cell7">Philadelphia</td>

I want to use python to extract the "Dave Mason's Traffic Jam" part, the "Scottish Rite Auditorium" part etc. individually from the the text. using this regular expression '.*' returns from the first tag to the last tag before the next newline. How can I change the expression so that it only returns the chunk between the tag pairs?

Edit: @HenryKeiter & @Hakiko that'd be grand but this for an assignment that requires me to use python regex.

Use a real HTML parser like [BeautifulSoup](http://beautiful-soup-4.readthedocs.org/en/latest/). Don't bother trying to parse HTML with regex. [That way lies madness.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Henry Keiter, May 10 '14 at 22:36
I think extract your content with regex a headache. Use a HTML parser. — hakki, May 10 '14 at 22:38
Re: your edit: if you are *required* to use regex, you need to better define what exactly you need to extract. All the content of all the cells? Just certain ones? Figure out how to describe what it is that you need to match, and the rest is just learning to use [regex syntax.](https://docs.python.org/2/library/re.html#regular-expression-syntax) Hint: `'.*'` will probably go in the middle of what you want, once you figure out what the boundaries should be. — Henry Keiter, May 10 '14 at 22:43
Also, as the only possible purpose of such an assignment is to teach you the Python regex syntax, I don't think it's wise to come here begging answers before you've really started. Once you can better define your problem, if you have a *specific* issue making a certain regex work, ask a question related to that and you'll be much more likely to get helpful, relevant answers. — Henry Keiter, May 10 '14 at 22:46

Oleg Gryb · Accepted Answer · 2014-05-10T22:46:36.990

1

Here is a hint, not a full solution: you'll need to use a non-greedy regexp in your case. Basically, you'll need to use

.*?

instead of

.*

Non-greedy means that a minimal pattern will be matched. By default - it's maximum.

edited May 10 '14 at 22:46

answered May 10 '14 at 22:38

Oleg Gryb

5,122
1
28
40

score 1 · Answer 2 · answered May 10 '14 at 22:53

Use Beautiful Soup:

from bs4 import BeautifulSoup

html = '''
<td class="cell1"><b>Dave Mason's Traffic Jam</b></td><td class="cell2">Scottish Rite
Auditorium</td><td class="cell3">$29-$45</td><td class="cell4">On sale now</td><td class="cell5"><a 
href="http://www.ticketmaster.com/dave-masons-traffic-jam-collingswood-new-jersey-11-29-2014/event  
/02004B48C416D202?artistid=1033927&majorcatid=10001&minorcatid=1&tm_link=venue_msg-
1_02004B48C416D202" target="_blank">TIX</a></td><td class="cell6">AA</td><td 
class="cell7">Philadelphia</td>
'''.strip()

soup = BeautifulSoup(html)
tds = soup.find_all('td')
contentList = []
for td in tds:
    contentList.append(td.get_text())
print contentList

Returns

[u"Dave Mason's Traffic Jam", u'Scottish Rite\nAuditorium', u'$29-$45', u'On sale now', u'TIX', u'AA', u'Philadelphia']

python regex: extract contents of an HTML element

2 Answers2