18

I have this example xml file

<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>

I like to extract the contents of title tags and content tags.

Which method is good to extract the data, using pattern matching or using xml module. Or is there any better way to extract the data.

Sudeep
  • 513
  • 3
  • 5
  • 9

6 Answers6

26

There is already a built-in XML library, notably ElementTree. For example:

>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
...   <title>Chapter 1</title>
...   <content>Welcome to Chapter 1</content>
... </page>
... <page>
...  <title>Chapter 2</title>
...  <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
...     title = page.find('title').text
...     content = page.find('content').text
...     print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2
Santa
  • 11,381
  • 8
  • 51
  • 64
  • @SudeepKodavati: If you think Santa has answered the question to your satisfaction, please "accept" his answer. – MattH Oct 07 '11 at 19:46
  • I like this interface, you can index into child tags `root[0][1][0]...`, as well as get an iterator from any node that will walk all child nodes! `list( root[0][1].itertext() )`Super handy! – ThorSummoner Apr 24 '16 at 06:38
  • `cElementTree` is no longer needed on supported versions of Python (3.3+), use `ElementTree`. – Gringo Suave Mar 27 '22 at 19:25
3

Code :

from xml.etree import cElementTree as ET

tree = ET.parse("test.xml")
root = tree.getroot()

for page in root.findall('page'):
    print("Title: ", page.find('title').text)
    print("Content: ", page.find('content').text)

Output:

Title:  Chapter 1
Content:  Welcome to Chapter 1
Title:  Chapter 2
Content:  Welcome to Chapter 2
Sashini Hettiarachchi
  • 1,630
  • 2
  • 10
  • 21
2

You can also try this code to extract texts:

from bs4 import BeautifulSoup
import csv

data ="""<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>"""

soup = BeautifulSoup(data, "html.parser")

########### Title #############
required0 = soup.find_all("title")
title = []
for i in required0:
    title.append(i.get_text())

########### Content #############
required0 = soup.find_all("content")
content = []
for i in required0:
    content.append(i.get_text())

doc1 = list(zip(title, content))
for i in doc1:
    print(i)

Output:

('Chapter 1', 'Welcome to Chapter 1')
('Chapter 2', 'Welcome to Chapter 2')
Ashok Kumar Jayaraman
  • 2,887
  • 2
  • 32
  • 40
1

I personally prefer parsing using xml.dom.minidom like so:

In [18]: import xml.dom.minidom

In [19]: x = """\
<root><page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page></root>"""

In [28]: doc = xml.dom.minidom.parseString(x)
In [29]: doc.getElementsByTagName("page")
Out[30]: [<DOM Element: page at 0x94d5acc>, <DOM Element: page at 0x94d5c8c>]

In [32]: [p.firstChild.wholeText for p in doc.getElementsByTagName("title") if p.firstChild.nodeType == p.TEXT_NODE]
Out[33]: [u'Chapter 1', u'Chapter 2']

In [34]: [p.firstChild.wholeText for p in doc.getElementsByTagName("content") if p.firstChild.nodeType == p.TEXT_NODE]
Out[35]: [u'Welcome to Chapter 1', u'Welcome to Chapter 2']

In [36]: for node in doc.childNodes:
             if node.hasChildNodes:
                 for cn in node.childNodes:
                     if cn.hasChildNodes:
                         for cn2 in cn.childNodes:
                             if cn2.nodeType == cn2.TEXT_NODE:
                                 print cn2.wholeText
Out[37]: Chapter 1
         Welcome to Chapter 1
         Chapter 2
         Welcome to Chapter 2
Andrew Stromme
  • 2,120
  • 23
  • 30
chown
  • 51,908
  • 16
  • 134
  • 170
-1

For working (navigating, searching, and modifying) with XML or HTML data, I found BeautifulSoup library very useful. For installation problem or detailed information, click on link.

To find Attribute (tag) or multi-attribute values:

from bs4 import BeautifulSoup
data = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.48.0">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
<text top="246" left="135" width="178" height="16" font="1">PALS SOCIETY OF 
CANADA</text>
<text top="261" width="86" height="16" font="1">13479 77 AVE</text>
</page>
</pdf2xml>"""

soup = BeautifulSoup(data, features="xml")
page_tag = soup.find_all('page')
for each_page in page_tag:
    text_tag = each_page.find_all('text')
    for text_data in text_tag:
        print("Text : ", text_data.text)
        print("Left attribute : ", text_data.get("left"))

Output:

Text :  PALS SOCIETY OF CANADA
Left tag :  135
Text :  13479 77 AVE
Left tag :  None
-1

Recommend you a simple library. Here's an example: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>'''
doc = SimplifiedDoc(html)
pages = doc.pages
print ([(page.title.text,page.content.text) for page in pages])

Result:

[('Chapter 1', 'Welcome to Chapter 1'), ('Chapter 2', 'Welcome to Chapter 2')]
dabingsou
  • 2,469
  • 1
  • 5
  • 8