2

considering this:

input = """Yesterday<person>Peter</person>drove to<location>New York</location>"""

how can one use regex patterns to extract:

person: Peter
location: New York

This works well, but I dont want to hard code the tags, they can change:

print re.findall("<person>(.*?)</person>", input)
print re.findall("<location>(.*?)</location>", input)
DevEx
  • 4,337
  • 13
  • 46
  • 68

2 Answers2

6

Use a tool designed for the work. I happen to like lxml but their are other

>>> minput = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
>>> from lxml import html
>>> tree = html.fromstring(minput)
>>> for e in tree.iter():
        print e, e.tag, e.text_content()
        if e.tag() == 'person':          # getting the last name per comment
           last = e.text_content().split()[-1]
           print last


<Element p at 0x3118ca8> p YesterdayPeter Smithdrove toNew York
<Element person at 0x3118b48> person Peter Smith
Smith                                            # here is the last name
<Element location at 0x3118ba0> location New York

If you are new to Python then you might want to visit this site to get an installer for a number of packages including LXML.

PyNEwbie
  • 4,882
  • 4
  • 38
  • 86
  • Thanks @PyNEwbie in case of ´Peter Smith´ how can I use ´text_content()´ to extract only ´Smith´ ? – DevEx Mar 24 '14 at 20:42
  • You can't but you can split the string once you have it. – PyNEwbie Mar 24 '14 at 21:16
  • @PyNEwbie, is there any similar method I can use to get words that are not inside the tags: `"Yesterday" & "drove to"` ? – DevEx Mar 26 '14 at 13:52
  • Precisely, I want to do something like: Yesterday _person_ Smith drove to _location_ New York. Thanks @PyNEwbie for helping with lxml to parse the trees – DevEx Mar 26 '14 at 14:29
3

Avoid parsing HTML with regex, use an HTML parser instead.

Here's an example using BeautifulSoup:

from bs4 import BeautifulSoup    

data = "Yesterday<person>Peter</person>drove to<location>New York</location>"
soup = BeautifulSoup(data)

print 'person: %s' % soup.person.text
print 'location: %s' % soup.location.text

prints:

person: Peter
location: New York

Note the simplicity of the code.

Hope that helps.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195