3

I am searching through an OPML file that looks something like this. I want to pull out the outline text and the xmlUrl.

  <outline text="lol">
  <outline text="Discourse on the Otter" xmlUrl="http://discourseontheotter.tumblr.com/rss" htmlUrl="http://discourseontheotter.tumblr.com/"/>
  <outline text="fedoras of okc" xmlUrl="http://fedorasofokc.tumblr.com/rss" htmlUrl="http://fedorasofokc.tumblr.com/"/>
  </outline>

My function:

 import re
 rssName = 'outline text="(.*?)"'
 rssUrl =  'xmlUrl="(.*?)"'

 def rssSearch():
     doc = open('ttrss.txt')
     for line in doc:
        if "xmlUrl" in line:
            mName = re.search(rssName, line)
            mUrl = re.search(rssUrl, line)
            if mName is not None:
                print mName.group()
                print mUrl.group()

However, the return values come out as:

 outline text="fedoras of okc"
 xmlUrl="http://fedorasofokc.tumblr.com/rss"

What is the proper regex expression for rssName and rssUrl so that I return only the string between the quotes?

Michael Geary
  • 28,450
  • 9
  • 65
  • 75
jumbopap
  • 3,969
  • 5
  • 27
  • 47
  • Quite unrelated to your question, but maybe still helpful: You can pre-compile a regular expression to save some nanoseconds of execution time. Use `rssName = re.compile('outline text="(.*?)"')` and `mName = rssName.search(line)`. – Kijewski Apr 24 '13 at 20:29
  • 3
    Why do you want to do this via regex? It's not the right tool. Use an xml parser, there are several in the standard library. – Daniel Roseman Apr 24 '13 at 20:31
  • In relation to @DanielRoseman's suggestion, if you want something easy to use that includes the kitchen sink, have a look at beautiful stone soup, the XML parsing component of the beautiful soup library. – Endophage Apr 24 '13 at 20:34

2 Answers2

3

Don't use regular expressions to parse XML. The code is messy, and there are too many things that can go wrong.

For example, what if your OPML provider happens to reformat their output like this:

<outline text="lol">
  <outline
      htmlUrl="http://discourseontheotter.tumblr.com/"
      xmlUrl="http://discourseontheotter.tumblr.com/rss"
      text="Discourse on the Otter"
  />
  <outline
      htmlUrl="http://fedorasofokc.tumblr.com/"
      xmlUrl="http://fedorasofokc.tumblr.com/rss"
      text="fedoras of okc"
  />
</outline>

That's perfectly valid, and it means exactly the same thing. But the line-oriented search and regular expressions like 'outline text="(.*?)"' will break.

Instead, use an XML parser. Your code will be cleaner, simpler, and more reliable:

import xml.etree.cElementTree as ET

root = ET.parse('ttrss.txt').getroot()
for outline in root.iter('outline'):
    text = outline.get('text')
    xmlUrl = outline.get('xmlUrl')
    if text and xmlUrl:
        print text
        print xmlUrl

This handles both your OPML snippet and similar OPML files I found on the web like this political science list. And it's very simple with nothing tricky about it. (I'm not bragging, that's just the benefit you get from using an XML parser instead of regular expressions.)

Michael Geary
  • 28,450
  • 9
  • 65
  • 75
2

try

print mName.group(1)
print mUrl.group(1)

http://docs.python.org/2/library/re.html#re.MatchObject.group

If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group.

or

rssName = 'outline text="(?P<text>.*?)"'

and then

print mName.group('text')
nacholibre
  • 3,874
  • 3
  • 32
  • 35