Python parse XML and save as txt

Question

I have a folder of .xml files which look like this:

<PubmedArticleSet>
  <PubmedArticle>
    <MedlineCitation Owner="NLM" Status="MEDLINE">
      <PMID Version="1">23458631</PMID>
      <DateCreated>
        <Year>2013</Year>
        <Month>04</Month>
        <Day>08</Day>
      </DateCreated>
      <MeshHeadingList>
        <MeshHeading>
          <DescriptorName MajorTopicYN="N">Animals</DescriptorName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName MajorTopicYN="N">Calcium</DescriptorName>
          <QualifierName MajorTopicYN="Y">metabolism</QualifierName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName MajorTopicYN="N">Calcium Chloride</DescriptorName>
          <QualifierName MajorTopicYN="N">administration &amp; dosage</QualifierName>
        </MeshHeading>
      </MeshHeadingList>
    </MedlineCitation>
  </PubmedArticle>
  <PubmedArticle>
    <MedlineCitation Status="Publisher" Owner="NLM">
      <PMID Version="1">23458629</PMID>
      <DateCreated>
        <Year>2013</Year>
        <Month>3</Month>
        <Day>20</Day>
      </DateCreated>
      <MeshHeadingList>
        <MeshHeading>
          <DescriptorName MajorTopicYN="N">Adolescent</DescriptorName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName MajorTopicYN="N">Adult</DescriptorName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName MajorTopicYN="N">Anthropometry</DescriptorName>
        </MeshHeading>
      </MeshHeadingList>
    </MedlineCitation>
  </PubmedArticle>
</PubmedArticleSet>

I would like to use Python to parse the XML files and extract PMID,DateCreated,all DescriptorName and MajorTopicYN for each article. Then, save the result as .txt file that looks like:

ArticleID|CreatedDate|MeSH|IsMajor
23458631|20130408|Animals|N
23458631|20130408|Calcium|N
23458631|20130408|Calcium Chloride|N
23458629|20130320|Adolescent|N
23458629|20130320|Adult|N
23458629|20130320|Anthropometry|N

Have a look at http://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python — shad0w_wa1k3r, Nov 04 '13 at 16:38

score 5 · Accepted Answer · answered Nov 04 '13 at 17:21

Here you go.

import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
with open('my_text_file.txt', 'w') as f:
    f.write('ArticleID|CreatedDate|MeSH|IsMajor\n')
for pubmed_article in root.findall('PubmedArticle'):
    ArticleID = pubmed_article.find('MedlineCitation').find('PMID').text
    year = pubmed_article.find('MedlineCitation').find('DateCreated').find('Year').text
    month = pubmed_article.find('MedlineCitation').find('DateCreated').find('Month').text
    day = pubmed_article.find('MedlineCitation').find('DateCreated').find('Day').text
    CreatedDate = year + month + day
    for mesh_heading in pubmed_article.find('MedlineCitation').find('MeshHeadingList').findall('MeshHeading'):
        MeSH = mesh_heading.find('DescriptorName').text
        IsMajor = mesh_heading.find('DescriptorName').get('MajorTopicYN')
        line_to_write = ArticleID + '|' + CreatedDate + '|' + MeSH + '|' + IsMajor + '\n'
        with open('my_text_file.txt', 'a') as f:
            f.write(line_to_write)

Here is the output file

ArticleID|CreatedDate|MeSH|IsMajor
23458631|20130408|Animals|N
23458631|20130408|Calcium|N
23458631|20130408|Calcium Chloride|N
23458629|20130320|Adolescent|N
23458629|20130320|Adult|N
23458629|20130320|Anthropometry|N

Have you changed the input file at all? As far as I can see this code will result in some of the dates showing as 2013320 rather than 20130320. — ChrisProsser, Nov 04 '13 at 17:46

score 1 · Answer 2 · answered Nov 04 '13 at 16:57

Use ElementTree http://docs.python.org/2/library/xml.etree.elementtree.html

import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
for pmid in root.iter('PMID'):
    print pmid.text

The output for this is

23458631
23458629

Once you have your element values you can build the strings and write them to a file.

score 0 · Answer 3 · answered Nov 04 '13 at 17:41

here is my version:

import xml.etree.ElementTree as ET

xml_path = r'Y:\Misc\stack_overflow\Python\xml_extract\data.xml'
output_file_path = 'output.txt'
f = open(output_file_path, 'wb')
f.write('ArticleID|CreatedDate|MeSH|IsMajor\n')

tree = ET.parse(xml_path)
root = tree.getroot()

for pa in root.iter('PubmedArticle'):
    ArticleID = pa.find('MedlineCitation/PMID').text
    CreatedDate = pa.find('MedlineCitation/DateCreated/Year').text+\
                  pa.find('MedlineCitation/DateCreated/Month').text.zfill(2)+\
                  pa.find('MedlineCitation/DateCreated/Day').text.zfill(2)
    for mh in pa.iter('MeshHeading'):
        DescriptorName = mh.find('DescriptorName').text
        MajorTopicYN = mh.find('DescriptorName').attrib['MajorTopicYN']
        f.write(ArticleID+'|'+CreatedDate+'|'+DescriptorName+'|'+MajorTopicYN+'\n')
f.close()

The output in the file is:

ArticleID|CreatedDate|MeSH|IsMajor
23458631|20130408|Animals|N
23458631|20130408|Calcium|N
23458631|20130408|Calcium Chloride|N
23458629|20130320|Adolescent|N
23458629|20130320|Adult|N
23458629|20130320|Anthropometry|N

Python parse XML and save as txt

3 Answers3