I'm writing a simple RSS parser (I know there are many already written) and I stumbled across a problem. Let's say I have the following RSS feed:
<channel>
<title>Sunset Boulevard</title>
<link>http://www.imdb.com/title/tt0043014/</link>
<description>A hack screenwriter writes a screenplay..</description>
<language>English</language>
<item>
<rating>8.6</rating>
</item>
</channel>
I have a method that by a given tag and subtags extracts them in a simple hash. Here's my "method":
def extract_text_from_tag(text, tag)
text =~ /<#{tag}.*?>(?<tag_text>.*?)<\/#{tag}>/m ? $~[:tag_text] : ''
end
To parse the channel, I first extract its text, and then, using an array of predefined tags (title, link, etc.), I extract their data. However, I want my regular expression to match only direct children of my tag.
For example here if I pass the 'title', 'link', 'description', 'language' and 'rating' tags, I want to match all of them except for 'rating' (because it's a child of item).