2

I have a bunch of XML files(about 74k) and they have this kind of structure:

<?xml version="1.0" encoding="UTF-8"?><article pmcid="2653499" pmid="19243591" doi="10.1186/1472-6963-9-38">
<title>Systematic review</title>
<fulltext>...</fulltext>
<figures>
<figure iri="1472-6963-9-38-2"><caption>...</caption></figure>
<figure iri="1472-6963-9-38-1"><caption>...</caption></figure>
</figures>
</article>

I'd like to relate the pmcid parameter(which is unique per file) with the iri parameter of the figures they contain in a list so I can build with them a numpy array or even a file easy to work with.

For instance for this article the line should be:

2653499 1472-6963-9-38-2 1472-6963-9-38-1

I have tried with XSLT without any results... I would appreciate any help.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
ssierral
  • 8,537
  • 6
  • 26
  • 44
  • Have you attempted using any of the available existing XML Parsing libs? See http://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python – rurouni88 Aug 13 '14 at 01:31

5 Answers5

4

Here's an option using xml.etree.ElementTree from the standard library:

import xml.etree.ElementTree as ET

data = """<?xml version="1.0" encoding="UTF-8"?>
<article pmcid="2653499" pmid="19243591" doi="10.1186/1472-6963-9-38">
    <title>Systematic review</title>
    <fulltext>...</fulltext>
    <figures>
        <figure iri="1472-6963-9-38-2"><caption>...</caption></figure>
        <figure iri="1472-6963-9-38-1"><caption>...</caption></figure>
    </figures>
</article>
"""

article = ET.fromstring(data)

pmcid = article.attrib.get('pmcid')
for figure in article.findall('figures/figure'):
    iri = figure.attrib.get('iri')
    print pmcid, iri

Prints:

2653499 1472-6963-9-38-2
2653499 1472-6963-9-38-1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
2

What about using Beautifulsoup?

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('file.xml'))

pmcid = soup.find('article')['pmcid']
figure = soup.findAll('figure')

print pmcid,

for i in figure:
    print i['iri'],

Prints exactly as your example.

2653499 1472-6963-9-38-2 1472-6963-9-38-1
mnjeremiah
  • 261
  • 3
  • 9
1

out.xsl:

<!-- http://www.w3.org/TR/xslt#copying -->
<!-- http://www.dpawson.co.uk/xsl/sect2/identity.html#d5917e43 -->
<!-- The Identity Transformation -->
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" version="1.0" encoding="UTF-8"/>

    <!-- Whenever you match any node or any attribute -->
    <xsl:template match="@*|node()">
        <!-- Copy the current node -->
        <xsl:copy>
            <!-- Including any attributes it has and any child nodes -->
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="article">
        <xsl:value-of select="@pmcid"/>
        <xsl:apply-templates select="figures/figure"/>
        <xsl:text>
</xsl:text>
    </xsl:template>

    <xsl:template match="figure">
        <xsl:text> </xsl:text><xsl:value-of select="@iri"/>
    </xsl:template>
</xsl:stylesheet>

Run:

$ xsltproc out.xsl in.xml
2653499 1472-6963-9-38-2 1472-6963-9-38-1
Etan Reisner
  • 77,877
  • 8
  • 106
  • 148
0

You can try xmllint.

xmllint --shell myxml <<< `echo 'cat /article/@pmcid|//figures/figure/@*'`
/ >  -------
 pmcid="2653499"
 -------
 iri="1472-6963-9-38-2"
 -------
 iri="1472-6963-9-38-1"
/ >

Then pipe to awk to get desired output ...

xmllint --shell myxml <<< `echo 'cat /article/@pmcid|//figures/figure/@*'` | 
awk -F'[="]' -v ORS=" " 'NF>1{print $3}'
2653499 1472-6963-9-38-2 1472-6963-9-38-1
jaypal singh
  • 74,723
  • 23
  • 102
  • 147
0

(A)

Well since you said ANY help...here's my shot-

From my experience, you're going to be much more satisfied prodding around with

obj.__dict__

and seeing how each xml element fits. This way you'll effectively spell check the entire xml file by passing an iteration test (the following)

I took your example data, placed it in .xml file, loaded it up with Python IDE (2.7.xxx). Here's how I crafted what code to use:

import xml.etree.ElementTree as ET
>>> some_tree = ET.parse("/Users/pro/Desktop/tech/test_scripts/test.xml")
>>> for block_number in range(0, len(some_tree._root.getchildren())):
    print "block_number: " + str(block_number)


block_number: 0
block_number: 1
block_number: 2
>>> some_tree._root.getchildren()
[<Element 'title' at 0x101a59450>, <Element 'fulltext' at 0x101a59550>, <Element 'figures' at 0x101a59410>]
>>> some_tree._root.__dict__
{'text': '\n', 'attrib': {'pmid': '19243591', 'doi': '10.1186/1472-6963-9-38', 'pmcid': '2653499'}, 'tag': 'article', '_children': [<Element 'title' at 0x101a59450>, <Element 'fulltext' at 0x101a59550>, <Element 'figures' at 0x101a59410>]}
>>> some_tree._root.attrib
{'pmid': '19243591', 'doi': '10.1186/1472-6963-9-38', 'pmcid': '2653499'}
>>> some_tree._root.attrib['pmid']
'19243591'
>>> to_store = {}
>>> to_store[some_tree._root.attrib['pmid']] = []
>>> some_tree._root.getchildren()
[<Element 'title' at 0x101a59450>, <Element 'fulltext' at 0x101a59550>, <Element 'figures' at 0x101a59410>]
>>> some_tree._root[2]
<Element 'figures' at 0x101a59410>
>>> some_tree._root[2].__dict__
{'text': '\n', 'attrib': {}, 'tag': 'figures', 'tail': '\n', '_children': [<Element 'figure' at 0x101a595d0>, <Element 'figure' at 0x101a59650>]}
>>> some_tree._root[2].getchildren()
[<Element 'figure' at 0x101a595d0>, <Element 'figure' at 0x101a59650>]
>>> for r in range(0, len(some_tree._root[2].getchildren())):
    print some_tree._root[2].getchildren()[r]


<Element 'figure' at 0x101a595d0>
<Element 'figure' at 0x101a59650>
>>> some_tree._root[2].getchildren()[1].__dict__
{'attrib': {'iri': '1472-6963-9-38-1'}, 'tag': 'figure', 'tail': '\n', '_children': [<Element 'caption' at 0x101a59690>]}
>>> for r in range(0, len(some_tree._root[2].getchildren())):
    to_store[to_store.keys()[0]].append(some_tree._root[2].getchildren()[r].attrib['iri'])


>>> to_store
{'19243591': ['1472-6963-9-38-2', '1472-6963-9-38-1']}
>>> 

Note that to_store is arbitrary and mere convenience for however you want to store those x,y pieces of data.

B)

I really liked outputting to my own sqlite flat file db. I did it for translating the entire Bible to use at runtime in an iOS app I released. Here's some example code for the sql:

import sqlite3
bible_books = ["genesis", "exodus", "leviticus", "numbers", "deuteronomy",
           "joshua", "judges", "ruth", "1 samuel", "2 samuel", "1 kings",
           "2 kings", "1 chronicles", "2 chronicles", "ezra", "nehemiah",
           "esther", "job", "psalms", "proverbs", "ecclesiastes",
           "song of solomon", "isaiah", "jeremiah", "lamentations",
           "ezekiel", "daniel", "hosea", "joel", "amos", "obadiah",
           "jonah", "micah", "nahum", "habakkuk", "zephaniah", "haggai",
           "zechariah", "malachi", "matthew", "mark", "luke", "john",
           "acts", "romans", "1 corinthians", "2 corinthians",
           "galatians", "ephesians", "philippians", "colossians",
           "1 thessalonians", "2 thessalonians", "1 timothy",
           "2 timothy", "titus", "philemon", "hebrews", "james",
           "1 peter", "2 peter", "1 john", "2 john", "3 john",
           "jude", "revelation"]
chapter_counts = {bible_books[0]:50, bible_books[1]:40, bible_books[2]:27,
          bible_books[3]:36, bible_books[4]:34, bible_books[5]:24,
          bible_books[6]:21, bible_books[7]:4, bible_books[8]:31,
          bible_books[9]:24, bible_books[10]:22, bible_books[11]:25,
          bible_books[12]:29, bible_books[13]:36, bible_books[14]:10,
          bible_books[15]:13, bible_books[16]:10, bible_books[17]:42,
          bible_books[18]:150, bible_books[19]:31, bible_books[20]:12,
          bible_books[21]:8, bible_books[22]:66, bible_books[23]:52,
          bible_books[24]:5, bible_books[25]:48, bible_books[26]:12,
          bible_books[27]:14, bible_books[28]:3, bible_books[29]:9,
          bible_books[30]:1, bible_books[31]:4, bible_books[32]:7,
          bible_books[33]:3, bible_books[34]:3,
          bible_books[35]:3, bible_books[36]:2, bible_books[37]:14,
          bible_books[38]:4, bible_books[39]:28, bible_books[40]:16,
          bible_books[41]:24, bible_books[42]:21, bible_books[43]:28,
          bible_books[44]:16, bible_books[45]:16, bible_books[46]:13,
          bible_books[47]:6, bible_books[48]:6, bible_books[49]:4,
          bible_books[50]:4, bible_books[51]:5, bible_books[52]:3,
          bible_books[53]:6, bible_books[54]:4, bible_books[55]:3,
          bible_books[56]:1, bible_books[57]:13, bible_books[58]:5,
          bible_books[59]:5, bible_books[60]:3, bible_books[61]:5,
          bible_books[62]:1, bible_books[63]:1, bible_books[64]:1,
          bible_books[65]:22}

conn = sqlite3.connect("bible_web.sqlite3")
c = conn.cursor()



for i_book in bible_books:
    book_name = "b_" + i_book.lower().replace(" ", "_")
    for i_chapter in range(1, chapter_counts[i_book]+1):
        c.execute("create table " + book_name + "_" + str(i_chapter) + " (verse real primary key, value text)")

for i_book in bible_books:
    book_name = "b_" + i_book.lower().replace(" ", "_")
    for i_chapter in range(1, chapter_counts[i_book]+1):
        #c.execute("SELECT Count(*) FROM " + book_name + "_" + str(i_chapter))
        #i_rows = int(c.fetchall())
        #for verse_number in range(1, i_rows+1):
        c.execute("update " + book_name + "_" + str(i_chapter) + " set value=trim(value)")

conn.commit()
c.close()
conn.close()

Just some ideas. Hope that helps.

jplego
  • 91
  • 4