Extracting text from XML using python

Question

I have this example xml file

<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>

I like to extract the contents of title tags and content tags.

Which method is good to extract the data, using pattern matching or using xml module. Or is there any better way to extract the data.

score 26 · Accepted Answer · answered Oct 07 '11 at 18:49

26

There is already a built-in XML library, notably ElementTree. For example:

>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
...   <title>Chapter 1</title>
...   <content>Welcome to Chapter 1</content>
... </page>
... <page>
...  <title>Chapter 2</title>
...  <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
...     title = page.find('title').text
...     content = page.find('content').text
...     print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2

answered Oct 07 '11 at 18:49

Santa

11,381
8
51
64

@SudeepKodavati: If you think Santa has answered the question to your satisfaction, please "accept" his answer. – MattH Oct 07 '11 at 19:46
I like this interface, you can index into child tags `root[0][1][0]...`, as well as get an iterator from any node that will walk all child nodes! `list( root[0][1].itertext() )`Super handy! – ThorSummoner Apr 24 '16 at 06:38
`cElementTree` is no longer needed on supported versions of Python (3.3+), use `ElementTree`. – Gringo Suave Mar 27 '22 at 19:25

score 3 · Answer 2 · answered Oct 18 '19 at 03:00

3

Code :

from xml.etree import cElementTree as ET

tree = ET.parse("test.xml")
root = tree.getroot()

for page in root.findall('page'):
    print("Title: ", page.find('title').text)
    print("Content: ", page.find('content').text)

Output:

Title:  Chapter 1
Content:  Welcome to Chapter 1
Title:  Chapter 2
Content:  Welcome to Chapter 2

answered Oct 18 '19 at 03:00

Sashini Hettiarachchi

1,630
2
10
21

`cElementTree` is no longer needed on supported versions of Python (3.3+), use `ElementTree`. – Gringo Suave Mar 27 '22 at 19:25

score 2 · Answer 3 · answered May 10 '19 at 12:11

You can also try this code to extract texts:

from bs4 import BeautifulSoup
import csv

data ="""<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>"""

soup = BeautifulSoup(data, "html.parser")

########### Title #############
required0 = soup.find_all("title")
title = []
for i in required0:
    title.append(i.get_text())

########### Content #############
required0 = soup.find_all("content")
content = []
for i in required0:
    content.append(i.get_text())

doc1 = list(zip(title, content))
for i in doc1:
    print(i)

Output:

('Chapter 1', 'Welcome to Chapter 1')
('Chapter 2', 'Welcome to Chapter 2')

score 1 · Answer 4 · edited Jun 11 '18 at 00:18

I personally prefer parsing using xml.dom.minidom like so:

In [18]: import xml.dom.minidom

In [19]: x = """\
<root><page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page></root>"""

In [28]: doc = xml.dom.minidom.parseString(x)
In [29]: doc.getElementsByTagName("page")
Out[30]: [<DOM Element: page at 0x94d5acc>, <DOM Element: page at 0x94d5c8c>]

In [32]: [p.firstChild.wholeText for p in doc.getElementsByTagName("title") if p.firstChild.nodeType == p.TEXT_NODE]
Out[33]: [u'Chapter 1', u'Chapter 2']

In [34]: [p.firstChild.wholeText for p in doc.getElementsByTagName("content") if p.firstChild.nodeType == p.TEXT_NODE]
Out[35]: [u'Welcome to Chapter 1', u'Welcome to Chapter 2']

In [36]: for node in doc.childNodes:
             if node.hasChildNodes:
                 for cn in node.childNodes:
                     if cn.hasChildNodes:
                         for cn2 in cn.childNodes:
                             if cn2.nodeType == cn2.TEXT_NODE:
                                 print cn2.wholeText
Out[37]: Chapter 1
         Welcome to Chapter 1
         Chapter 2
         Welcome to Chapter 2

@qed root and doc are the same thing in this case. I updated the code. — Andrew Stromme, Jun 10 '18 at 20:14

Himanshu Gupta · Answer 5 · 2023-03-16T07:27:19.560

For working (navigating, searching, and modifying) with XML or HTML data, I found BeautifulSoup library very useful. For installation problem or detailed information, click on link.

To find Attribute (tag) or multi-attribute values:

from bs4 import BeautifulSoup
data = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.48.0">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
<text top="246" left="135" width="178" height="16" font="1">PALS SOCIETY OF 
CANADA</text>
<text top="261" width="86" height="16" font="1">13479 77 AVE</text>
</page>
</pdf2xml>"""

soup = BeautifulSoup(data, features="xml")
page_tag = soup.find_all('page')
for each_page in page_tag:
    text_tag = each_page.find_all('text')
    for text_data in text_tag:
        print("Text : ", text_data.text)
        print("Left attribute : ", text_data.get("left"))

Output:

Text :  PALS SOCIETY OF CANADA
Left tag :  135
Text :  13479 77 AVE
Left tag :  None

score -1 · Answer 6 · answered Jan 29 '20 at 07:19

Recommend you a simple library. Here's an example: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>'''
doc = SimplifiedDoc(html)
pages = doc.pages
print ([(page.title.text,page.content.text) for page in pages])

Result:

[('Chapter 1', 'Welcome to Chapter 1'), ('Chapter 2', 'Welcome to Chapter 2')]

Extracting text from XML using python

6 Answers6

Linked