Regular expression to match only direct subtag?

Question

I'm writing a simple RSS parser (I know there are many already written) and I stumbled across a problem. Let's say I have the following RSS feed:

<channel>
  <title>Sunset Boulevard</title>
  <link>http://www.imdb.com/title/tt0043014/</link>
  <description>A hack screenwriter writes a screenplay..</description>
  <language>English</language>
  <item>
    <rating>8.6</rating>
  </item>
</channel>

I have a method that by a given tag and subtags extracts them in a simple hash. Here's my "method":

def extract_text_from_tag(text, tag)
  text =~ /<#{tag}.*?>(?<tag_text>.*?)<\/#{tag}>/m ? $~[:tag_text] : ''
end

To parse the channel, I first extract its text, and then, using an array of predefined tags (title, link, etc.), I extract their data. However, I want my regular expression to match only direct children of my tag.

For example here if I pass the 'title', 'link', 'description', 'language' and 'rating' tags, I want to match all of them except for 'rating' (because it's a child of item).

This is why parsing XML with regular expressions is tricky. Possible (for well-defined cases), but tricky. — Michael Myers, Feb 14 '13 at 15:07
Is it a requirement to do it with regex-es? solving this with xpath or via dom parsing seems easier... — Laur Ivan, Feb 14 '13 at 15:09
MichaelMyers - I know it's tricky, but the format is well-defined. equinoxel - Yes, it's a requirement. — , Feb 14 '13 at 15:18
Yes, but I'm interested in the regular expression, I can "translate" it into a ruby one if I have to. :) — , Feb 14 '13 at 15:29
I thought maybe specifying the language would help the code prettifier not mangle the highlighting for the function, but apparently not. The prettifier is fairly brittle because, get this, it uses regular expressions to parse non-regular languages. — Michael Myers, Feb 14 '13 at 15:40
A quick-and-dirty approach is to return `''` if the captured text looks like it contains a tag -- for example, using a second regex like this: `/<\w+>/`. — FMc, Feb 14 '13 at 16:06
I'd strongly recommend reading "[RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)". It covers the issues of trying to use regex to parse HTML or XML. While it might seem "fun" to write a RSS parser, you really should consider reusing a wheel, rather than invent your own. RSS in the wild is a mess, with several specs, not including ATOM, which is also used for feeds. I wrote one that was parsing all the variations, handling hundreds of feeds, and it was an "interesting" challenge. — the Tin Man, Feb 14 '13 at 18:07

maerics · Answer 1 · 2013-02-14T18:09:34.653

I see from the comments that you must parse this RSS feed with regular expressions instead of a proper XML parser.

However, as a counterexample, here's what a solution would look like using Nokogiri:

doc = Nokogiri::XML(rss_xml_string)
doc.xpath('/channel/*').each do |node| # For each child of the root "channel".
  next if node.children.length > 1 # Skip nodes with multiple children.
  puts node.name + ': ' + node.text
end
# title: Sunset Boulevard
# link: http://www.imdb.com/title/tt0043014/
# description: A hack screenwriter writes a screenplay..
# language: English

score 0 · Answer 2 · answered Feb 14 '13 at 16:22

With the caveat that things can get complicated beyond what you can do with regex, here are some suggestions:

Instead of .*? you can use [^<>]*?, assuming that "<" and ">" are escaped properly in the XML.

This would prevent extracting the content of item when it contains a child item, which may or many not be the desired behavior (I take it that this is ok in your example, but it might not be ok in general).

If you still need to extract the content of "item" (if any) except for the possible child items, you need to use conditional statements which, if I am not mistaken, are not supported in Ruby.

You can replace it with a method to test if the tag contains a child element and apply regex accordingly but it does get quite a bit complex.

Regular expression to match only direct subtag?

2 Answers2