3

I have a corpus with tens of thousands of XML file (small sized files) and I'm trying to use Python and extract the text contained in one of the XML tags, for example, everything between the body tags for something like:

<body> sample text here with <bold> nested </bold> tags in this paragraph </body>

and then write a text document that contains this string, and move on down the list of XML files.

I'm using effbot's ELementTree but couldn't find the right commands/syntax to do this. I found a website that uses miniDOM's dom.getElementsByTagName but I'm not sure what the corresponding method is for ElementTree. Any ideas would be greatly appreciated.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Levar
  • 55
  • 1
  • 7
  • 2
    I'd start with reading some tutorials then; the [Dive into Python 3 XML chapter](http://getpython3.com/diveintopython3/xml.html) would be a good start. – Martijn Pieters Jun 16 '12 at 16:10
  • in your example, do you want to also get the tags `` or only the text inside it? – Facundo Casco Jun 16 '12 at 16:11
  • And is there any content other outside of the `body` tag? – poke Jun 16 '12 at 16:14
  • This answer may also help http://stackoverflow.com/a/4624146/1290420 – daedalus Jun 16 '12 at 16:22
  • There is more content outside of the body tag but I think for all the XML files, the body tag is always the a child of the root tag. I only want to get the text in the body tag and none of the nested tags. Thanks for the links. I will try those. – Levar Jun 16 '12 at 16:36

2 Answers2

2

A better answer, showing how to actually use XML parsing to do this:

import xml.etree.ElementTree as ET
stringofxml = "<body> sample text here with <bold> nested </bold> tags in this paragraph </body>"

def extractTextFromElement(elementName, stringofxml):
    tree = ET.fromstring(stringofxml)
    for child in tree:
        if child.tag == elementName:
            return child.text.strip()

print extractTextFromElement('bold', stringofxml)
Hawkwing
  • 663
  • 1
  • 5
  • 13
1

I would just use re:

import re
body_txt = re.match('<body>(.*)</body>',body_txt).groups()[0]

then to remove the inner tags:

body_txt = re.sub('<.*?>','',body_txt)

You shouldn't use regexp when they are not needed, it's true... but there's nothing wrong with using them when they are.

Scruffy
  • 908
  • 1
  • 8
  • 21