1

I'm trying to create a regex quite unsuccessfully, what I'm looking to do is get the content of any html element that has a class of (author|byline|writer)

Here is what I have so far

<([A-Z][A-Z0-9]*)class=\"(byLineTag|byline|author|by)\"[^>]*>(.*?)</\1>

examples of what I need to match to

  <h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>

or

<div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>

Any help would be appreciated a lot. -Stefan

  • 4
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 – MitMaro Jul 04 '11 at 21:57
  • 1
    Don't do it. See MitMaro's link. Imagine something like `
    hello world
    another block
    `. It cannot be done. HTML is not a regular language. Use an appropriate parser.
    – Kerrek SB Jul 04 '11 at 22:02
  • Can you post some sample input and the output expected. – Stephan Jul 04 '11 at 22:02
  • Remember that Perl-style (Python) regexps are case-sensitive, so [A-Z] and [a-z] is not the same thing. To match all letters, you must write [A-Za-z]. That said, you should really be looking at Python's HTML parser instead --- it will save you a lot of trouble, and is also quite good at understanding broken HTML (and there is *a lot* of broken HTML out there). – jforberg Jul 04 '11 at 22:06

4 Answers4

2

Regex is not particularly well-suited to parsing HTML.
Thankfully there are tools specifically created for parsing HTML, e.g. BeautifulSoup and lxml; the latter of which is demonstrated below:

markup = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6><div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>'''

import lxml.html

import lxml.html
doc = lxml.html.fromstring(markup)
for a in doc.cssselect('.author, .by, .byline, .byLineTag'):
    print a.text_content()
# By JACK EWING and LANDON THOMAS Jr.
# By 
# Sarah Shemkus
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
2

Strongly suggest not using a regexp to parse the html for reasons already mentioned. Use an existing HTML parser. AS an example of how easy it can be, I've included an example of using lxml and it's CSS selector.

from lxml import etree
from lxml.cssselect import CSSSelector

## Your html string
html_string = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>'''

## lxml html parser
html = etree.HTML(html_string)

## lxml CSS selector
sel = CSSSelector('.author, .byline, .writer')

## Call the selector to get matches
matching_elements = sel(html)

for elem in matching_elements:
    primt elem.text
Rob Cowie
  • 22,259
  • 6
  • 62
  • 56
0

Try this :

<([A-Z][A-Z0-9]*).*?class=\"(byLineTag|byline|author|by)\"[^>]*?>(.*?)</\1>

What i have added :
- .*?, in case the class attribute doesn't appear right after the starting tag.
- *? , set the * operator as non greedy for finding the closing >

Stephan
  • 41,764
  • 65
  • 238
  • 329
0

You forgot to account for the space between the tag name and the first attribute name. Also, unless you're sure that class will always be the first attribute, you should account for the opposite in your expression. Furthermore, the \1 should be a \0 (back-references are zero-indexed), if you really care about the closing tag. As I've noted in my comment, you should also include lower-case characters in your wildcards.

Here is a better expression (I've disregarded the closing tag to make it simpler):

<[A-Za-z][A-Za-z0-9]*.*? class=["'](byLineTag|byline|author|by)["'][^>]*>

Remeber to run all lines together first, to avoid errors when tags are split across several lines. Of course, you would probably save yourself a lot of work if you used Python's HTML parser instead.

jforberg
  • 6,537
  • 3
  • 29
  • 47