How to extract XML specific value fields and list them?

Question

I have a bunch of XML files(about 74k) and they have this kind of structure:

<?xml version="1.0" encoding="UTF-8"?><article pmcid="2653499" pmid="19243591" doi="10.1186/1472-6963-9-38">
<title>Systematic review</title>
<fulltext>...</fulltext>
<figures>
<figure iri="1472-6963-9-38-2"><caption>...</caption></figure>
<figure iri="1472-6963-9-38-1"><caption>...</caption></figure>
</figures>
</article>

I'd like to relate the pmcid parameter(which is unique per file) with the iri parameter of the figures they contain in a list so I can build with them a numpy array or even a file easy to work with.

For instance for this article the line should be:

2653499 1472-6963-9-38-2 1472-6963-9-38-1

I have tried with XSLT without any results... I would appreciate any help.

Have you attempted using any of the available existing XML Parsing libs? See http://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python — rurouni88, Aug 13 '14 at 01:31

score 4 · Answer 1 · answered Aug 13 '14 at 01:35

Here's an option using xml.etree.ElementTree from the standard library:

import xml.etree.ElementTree as ET

data = """<?xml version="1.0" encoding="UTF-8"?>
<article pmcid="2653499" pmid="19243591" doi="10.1186/1472-6963-9-38">
    <title>Systematic review</title>
    <fulltext>...</fulltext>
    <figures>
        <figure iri="1472-6963-9-38-2"><caption>...</caption></figure>
        <figure iri="1472-6963-9-38-1"><caption>...</caption></figure>
    </figures>
</article>
"""

article = ET.fromstring(data)

pmcid = article.attrib.get('pmcid')
for figure in article.findall('figures/figure'):
    iri = figure.attrib.get('iri')
    print pmcid, iri

Prints:

2653499 1472-6963-9-38-2
2653499 1472-6963-9-38-1

mnjeremiah · Answer 2 · 2014-08-13T01:58:09.280

2

What about using Beautifulsoup?

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('file.xml'))

pmcid = soup.find('article')['pmcid']
figure = soup.findAll('figure')

print pmcid,

for i in figure:
    print i['iri'],

Prints exactly as your example.

2653499 1472-6963-9-38-2 1472-6963-9-38-1

edited Aug 13 '14 at 01:58

answered Aug 13 '14 at 01:41

mnjeremiah

261
3
9

Thanks for sharing :) I'm going to try your way – jplego Aug 13 '14 at 02:37

score 1 · Accepted Answer · answered Aug 13 '14 at 01:53

out.xsl:

<!-- http://www.w3.org/TR/xslt#copying -->
<!-- http://www.dpawson.co.uk/xsl/sect2/identity.html#d5917e43 -->
<!-- The Identity Transformation -->
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" version="1.0" encoding="UTF-8"/>

    <!-- Whenever you match any node or any attribute -->
    <xsl:template match="@*|node()">
        <!-- Copy the current node -->
        <xsl:copy>
            <!-- Including any attributes it has and any child nodes -->
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="article">
        <xsl:value-of select="@pmcid"/>
        <xsl:apply-templates select="figures/figure"/>
        <xsl:text>
</xsl:text>
    </xsl:template>

    <xsl:template match="figure">
        <xsl:text> </xsl:text><xsl:value-of select="@iri"/>
    </xsl:template>
</xsl:stylesheet>

Run:

$ xsltproc out.xsl in.xml
2653499 1472-6963-9-38-2 1472-6963-9-38-1

score 0 · Answer 4 · answered Aug 13 '14 at 01:49

You can try xmllint.

xmllint --shell myxml <<< `echo 'cat /article/@pmcid|//figures/figure/@*'`
/ >  -------
 pmcid="2653499"
 -------
 iri="1472-6963-9-38-2"
 -------
 iri="1472-6963-9-38-1"
/ >

Then pipe to awk to get desired output ...

xmllint --shell myxml <<< `echo 'cat /article/@pmcid|//figures/figure/@*'` | 
awk -F'[="]' -v ORS=" " 'NF>1{print $3}'
2653499 1472-6963-9-38-2 1472-6963-9-38-1

score 0 · Answer 5 · answered Aug 13 '14 at 02:31

(A)

Well since you said ANY help...here's my shot-

From my experience, you're going to be much more satisfied prodding around with

obj.__dict__

and seeing how each xml element fits. This way you'll effectively spell check the entire xml file by passing an iteration test (the following)

I took your example data, placed it in .xml file, loaded it up with Python IDE (2.7.xxx). Here's how I crafted what code to use:

import xml.etree.ElementTree as ET
>>> some_tree = ET.parse("/Users/pro/Desktop/tech/test_scripts/test.xml")
>>> for block_number in range(0, len(some_tree._root.getchildren())):
    print "block_number: " + str(block_number)


block_number: 0
block_number: 1
block_number: 2
>>> some_tree._root.getchildren()
[<Element 'title' at 0x101a59450>, <Element 'fulltext' at 0x101a59550>, <Element 'figures' at 0x101a59410>]
>>> some_tree._root.__dict__
{'text': '\n', 'attrib': {'pmid': '19243591', 'doi': '10.1186/1472-6963-9-38', 'pmcid': '2653499'}, 'tag': 'article', '_children': [<Element 'title' at 0x101a59450>, <Element 'fulltext' at 0x101a59550>, <Element 'figures' at 0x101a59410>]}
>>> some_tree._root.attrib
{'pmid': '19243591', 'doi': '10.1186/1472-6963-9-38', 'pmcid': '2653499'}
>>> some_tree._root.attrib['pmid']
'19243591'
>>> to_store = {}
>>> to_store[some_tree._root.attrib['pmid']] = []
>>> some_tree._root.getchildren()
[<Element 'title' at 0x101a59450>, <Element 'fulltext' at 0x101a59550>, <Element 'figures' at 0x101a59410>]
>>> some_tree._root[2]
<Element 'figures' at 0x101a59410>
>>> some_tree._root[2].__dict__
{'text': '\n', 'attrib': {}, 'tag': 'figures', 'tail': '\n', '_children': [<Element 'figure' at 0x101a595d0>, <Element 'figure' at 0x101a59650>]}
>>> some_tree._root[2].getchildren()
[<Element 'figure' at 0x101a595d0>, <Element 'figure' at 0x101a59650>]
>>> for r in range(0, len(some_tree._root[2].getchildren())):
    print some_tree._root[2].getchildren()[r]


<Element 'figure' at 0x101a595d0>
<Element 'figure' at 0x101a59650>
>>> some_tree._root[2].getchildren()[1].__dict__
{'attrib': {'iri': '1472-6963-9-38-1'}, 'tag': 'figure', 'tail': '\n', '_children': [<Element 'caption' at 0x101a59690>]}
>>> for r in range(0, len(some_tree._root[2].getchildren())):
    to_store[to_store.keys()[0]].append(some_tree._root[2].getchildren()[r].attrib['iri'])


>>> to_store
{'19243591': ['1472-6963-9-38-2', '1472-6963-9-38-1']}
>>>

Note that to_store is arbitrary and mere convenience for however you want to store those x,y pieces of data.

B)

I really liked outputting to my own sqlite flat file db. I did it for translating the entire Bible to use at runtime in an iOS app I released. Here's some example code for the sql:

import sqlite3
bible_books = ["genesis", "exodus", "leviticus", "numbers", "deuteronomy",
           "joshua", "judges", "ruth", "1 samuel", "2 samuel", "1 kings",
           "2 kings", "1 chronicles", "2 chronicles", "ezra", "nehemiah",
           "esther", "job", "psalms", "proverbs", "ecclesiastes",
           "song of solomon", "isaiah", "jeremiah", "lamentations",
           "ezekiel", "daniel", "hosea", "joel", "amos", "obadiah",
           "jonah", "micah", "nahum", "habakkuk", "zephaniah", "haggai",
           "zechariah", "malachi", "matthew", "mark", "luke", "john",
           "acts", "romans", "1 corinthians", "2 corinthians",
           "galatians", "ephesians", "philippians", "colossians",
           "1 thessalonians", "2 thessalonians", "1 timothy",
           "2 timothy", "titus", "philemon", "hebrews", "james",
           "1 peter", "2 peter", "1 john", "2 john", "3 john",
           "jude", "revelation"]
chapter_counts = {bible_books[0]:50, bible_books[1]:40, bible_books[2]:27,
          bible_books[3]:36, bible_books[4]:34, bible_books[5]:24,
          bible_books[6]:21, bible_books[7]:4, bible_books[8]:31,
          bible_books[9]:24, bible_books[10]:22, bible_books[11]:25,
          bible_books[12]:29, bible_books[13]:36, bible_books[14]:10,
          bible_books[15]:13, bible_books[16]:10, bible_books[17]:42,
          bible_books[18]:150, bible_books[19]:31, bible_books[20]:12,
          bible_books[21]:8, bible_books[22]:66, bible_books[23]:52,
          bible_books[24]:5, bible_books[25]:48, bible_books[26]:12,
          bible_books[27]:14, bible_books[28]:3, bible_books[29]:9,
          bible_books[30]:1, bible_books[31]:4, bible_books[32]:7,
          bible_books[33]:3, bible_books[34]:3,
          bible_books[35]:3, bible_books[36]:2, bible_books[37]:14,
          bible_books[38]:4, bible_books[39]:28, bible_books[40]:16,
          bible_books[41]:24, bible_books[42]:21, bible_books[43]:28,
          bible_books[44]:16, bible_books[45]:16, bible_books[46]:13,
          bible_books[47]:6, bible_books[48]:6, bible_books[49]:4,
          bible_books[50]:4, bible_books[51]:5, bible_books[52]:3,
          bible_books[53]:6, bible_books[54]:4, bible_books[55]:3,
          bible_books[56]:1, bible_books[57]:13, bible_books[58]:5,
          bible_books[59]:5, bible_books[60]:3, bible_books[61]:5,
          bible_books[62]:1, bible_books[63]:1, bible_books[64]:1,
          bible_books[65]:22}

conn = sqlite3.connect("bible_web.sqlite3")
c = conn.cursor()



for i_book in bible_books:
    book_name = "b_" + i_book.lower().replace(" ", "_")
    for i_chapter in range(1, chapter_counts[i_book]+1):
        c.execute("create table " + book_name + "_" + str(i_chapter) + " (verse real primary key, value text)")

for i_book in bible_books:
    book_name = "b_" + i_book.lower().replace(" ", "_")
    for i_chapter in range(1, chapter_counts[i_book]+1):
        #c.execute("SELECT Count(*) FROM " + book_name + "_" + str(i_chapter))
        #i_rows = int(c.fetchall())
        #for verse_number in range(1, i_rows+1):
        c.execute("update " + book_name + "_" + str(i_chapter) + " set value=trim(value)")

conn.commit()
c.close()
conn.close()

Just some ideas. Hope that helps.

How to extract XML specific value fields and list them?

5 Answers5

Linked