2

I have data in XML format. Example is shown as follow. I want to extract data from <text> tag. Here is my XML data.

    <text>
    The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.

    <h1>The plot</h1>
    Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
    <h1>Cast</h1>

    <h1>Soundtrack</h1>

    <h1>External Links</h1>
</text>

I need only The 40-Year-Old Virgin is a 2005 American buddy comedy film about a middle-aged man's journey to finally have sex. Is it possible? thanks

ObscureRobot
  • 7,306
  • 2
  • 27
  • 36
no_freedom
  • 1,963
  • 10
  • 30
  • 48
  • 4
    Don't use regex. Use [xml.etree.ElementTree](http://docs.python.org/library/xml.etree.elementtree.html). – jathanism Oct 20 '11 at 15:26
  • You're asking a lot of questions that are remarkably similar... – MattH Oct 20 '11 at 15:30
  • 1
    Obligatory XML-regex link http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – jasso Oct 20 '11 at 15:37

5 Answers5

4

Use an XML parser to parse XML. Using lxml:

import lxml.etree as ET

content='''\
<text>
    The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.

    <h1>The plot</h1>
    Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
    <h1>Cast</h1>

    <h1>Soundtrack</h1>

    <h1>External Links</h1>
</text>
'''

text=ET.fromstring(content)
print(text.text)

yields

    The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
2

Don't use regular expression to parse XML/HTML. Use a proper parser like BeautifulSoup or lxml in python.

Gabriel Ross
  • 5,168
  • 1
  • 28
  • 30
2

Whenever you find yourself looking at XML data and thinking about regular expressions, you should stop and ask yourself why you aren't considering a real XML parser. The structure of XML makes it perfectly suited for a proper parser, and maddeningly frustrating for regular expressions.

If you must use regular expressions, the following should do. Until your document changes!

import re
p = re.compile("<text>(.*)<h1>")
p.search(xml_text).group(1)

Spoiler: regular expressions might be appropriate if this is just a one-off problem that needs a quick and dirty solution. Or they might be appropriate if you know the input data will be fairly static and can't tolerate the overhead of a parser.

ObscureRobot
  • 7,306
  • 2
  • 27
  • 36
2

Here is how you could do this using ElementTree:

In [18]: import xml.etree.ElementTree as et

In [19]: t = et.parse('f.xml')

In [20]: print t.getroot().text.strip()
The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.
NPE
  • 486,780
  • 108
  • 951
  • 1,012
1

Here is an example using xml.etree.ElementTree:

>>> import xml.etree.ElementTree as ET
>>> data = """<text>
...     The 40-Year-Old Virgin is a 2005 American buddy comedy
...     film about a middle-aged man's journey to finally have sex.
... 
...     <h1>The plot</h1>
...     Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
...     <h1>Cast</h1>
... 
...     <h1>Soundtrack</h1>
... 
...     <h1>External Links</h1>
... </text>"""
>>> xml = ET.XML(data)
>>> xml.text
"\n    The 40-Year-Old Virgin is a 2005 American buddy comedy\n    film about a middle-aged man's journey to finally have sex.\n\n    "
>>> xml.text.strip().replace('\n   ', '')
"The 40-Year-Old Virgin is a 2005 American buddy comedy film about a middle-aged man's journey to finally have sex."

And there you go!

jathanism
  • 33,067
  • 9
  • 68
  • 86