Complicated regex to extract author name in python

Question

I'm trying to create a regex quite unsuccessfully, what I'm looking to do is get the content of any html element that has a class of (author|byline|writer)

Here is what I have so far

<([A-Z][A-Z0-9]*)class=\"(byLineTag|byline|author|by)\"[^>]*>(.*?)</\1>

examples of what I need to match to

  <h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>

or

<div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>

Any help would be appreciated a lot. -Stefan

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 — MitMaro, Jul 04 '11 at 21:57
Don't do it. See MitMaro's link. Imagine something like `
hello world
another block
`. It cannot be done. HTML is not a regular language. Use an appropriate parser. — Kerrek SB, Jul 04 '11 at 22:02
Remember that Perl-style (Python) regexps are case-sensitive, so [A-Z] and [a-z] is not the same thing. To match all letters, you must write [A-Za-z]. That said, you should really be looking at Python's HTML parser instead --- it will save you a lot of trouble, and is also quite good at understanding broken HTML (and there is *a lot* of broken HTML out there). — jforberg, Jul 04 '11 at 22:06

score 2 · Answer 1 · answered Jul 04 '11 at 22:29

Regex is not particularly well-suited to parsing HTML.
Thankfully there are tools specifically created for parsing HTML, e.g. BeautifulSoup and lxml; the latter of which is demonstrated below:

markup = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6><div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>'''

import lxml.html

import lxml.html
doc = lxml.html.fromstring(markup)
for a in doc.cssselect('.author, .by, .byline, .byLineTag'):
    print a.text_content()
# By JACK EWING and LANDON THOMAS Jr.
# By 
# Sarah Shemkus

+1 for the alternative way of using a CSS selector. I must have missed .cssselect() — Rob Cowie, Jul 05 '11 at 00:12

score 2 · Answer 2 · answered Jul 04 '11 at 22:30

Strongly suggest not using a regexp to parse the html for reasons already mentioned. Use an existing HTML parser. AS an example of how easy it can be, I've included an example of using lxml and it's CSS selector.

from lxml import etree
from lxml.cssselect import CSSSelector

## Your html string
html_string = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>'''

## lxml html parser
html = etree.HTML(html_string)

## lxml CSS selector
sel = CSSSelector('.author, .byline, .writer')

## Call the selector to get matches
matching_elements = sel(html)

for elem in matching_elements:
    primt elem.text

Stephan · Answer 3 · 2011-07-04T22:21:23.480

0

Try this :

<([A-Z][A-Z0-9]*).*?class=\"(byLineTag|byline|author|by)\"[^>]*?>(.*?)</\1>

What i have added :
- .*?, in case the class attribute doesn't appear right after the starting tag.
- *? , set the * operator as non greedy for finding the closing >

edited Jul 04 '11 at 22:21

answered Jul 04 '11 at 22:04

Stephan

41,764
65
238
329

Thanks for the prompt response, This works on my first example but not on the second. – Stefan Harris Jul 04 '11 at 22:10
I have added a small enhancement to the regexp **?** at the end of the regexp/ can you try it – Stephan Jul 04 '11 at 22:16

score 0 · Answer 4 · answered Jul 04 '11 at 22:29

You forgot to account for the space between the tag name and the first attribute name. Also, unless you're sure that class will always be the first attribute, you should account for the opposite in your expression. Furthermore, the \1 should be a \0 (back-references are zero-indexed), if you really care about the closing tag. As I've noted in my comment, you should also include lower-case characters in your wildcards.

Here is a better expression (I've disregarded the closing tag to make it simpler):

<[A-Za-z][A-Za-z0-9]*.*? class=["'](byLineTag|byline|author|by)["'][^>]*>

Remeber to run all lines together first, to avoid errors when tags are split across several lines. Of course, you would probably save yourself a lot of work if you used Python's HTML parser instead.

Thanks but this doesn't capture the content of the tag. – Stefan Harris Jul 04 '11 at 22:40 — Stefan Harris, Jul 04 '11 at 22:40

Complicated regex to extract author name in python

4 Answers4