How to get all relevant fields from a XML file into a pandas dataframe in Python using xml.etree.ElementTree?

Question

I'm trying to parse an XML file from gene expression omnibus. I found out how to get some of the data fields but I can't figure out how to get info like like <Title>.

I tried adapting: How to convert an XML file to nice pandas dataframe? but was only able to get some of the information.

How can I extract all of the available data into a pandas dataframe?

Here's an example of the XML file:

<Sample iid="GSM2978341">
    <Status database="GEO">
      <Submission-Date>2018-02-05</Submission-Date>
      <Release-Date>2019-03-25</Release-Date>
      <Last-Update-Date>2019-03-25</Last-Update-Date>
    </Status>
    <Title>PDD_P2_70</Title>
    <Accession database="GEO">GSM2978341</Accession>
    <Type>SRA</Type>
    <Channel-Count>1</Channel-Count>
    <Channel position="1">
      <Source>AZ-LolCDE</Source>
      <Organism taxid="679895">Escherichia coli BW25113</Organism>
      <Characteristics tag="strain">
BW25113
      </Characteristics>
      <Characteristics tag="type">
Gram-negative bacteria
      </Characteristics>
      <Characteristics tag="moa">
cell wall synthesis inhibitor / lipoprotein
      </Characteristics>
      <Characteristics tag="phenotype">
EC90 of phenotype
      </Characteristics>
      <Characteristics tag="treatment time">
~ 25 min
      </Characteristics>
      <Characteristics tag="treatment concentration">
200 uM
      </Characteristics>
      <Treatment-Protocol>
bacteria were treated with different antibiotics for ~ 25 min till  ~OD 0.2  in 2 ml tubes
      </Treatment-Protocol>
      <Growth-Protocol>
bacteria were grown in iso-sensitest medium
      </Growth-Protocol>
      <Molecule>total RNA</Molecule>
      <Extract-Protocol>
after treament bacteria were resuspended in QiaGen RNAprotect Bacteria Reagent (QiaGen #76506), incubated for 5min, centrifuged, and flash frozen on dry ice. Total RNA was extracted by incubating bacteria in Enzymatic Lysis Buffer  (lysozyme &amp; proteinase K) for 5 min followed by addition of QiaGen RLT Lysis Buffer and RNA purification  using the QiaGen RNeasy Mini kit combined with DNase treatment on a solid support (QiaGen #74104). RNA quality assessment and quantification was performed using microfluidic chip analysis on an Agilent 2100 bioanalyzer (Agilent Technologies).
For RNA-sequencing library preparation, 1000 ng total RNA was used as input. First, bacterial ribosomal RNA was depleted using the Ribo-Zero Magnetic Kit Bacteria (Illumina #MRZB12424). After depletion, RNA was resuspended in TruSeq Total RNA Sample Prep Kit Fragmentation buffer (8.5 ul RNA and 8.5 buffer) and reversed transcribed into cDNA using random hexamer primer. Then cDNA was further processed for the construction of sequencing libraries according to the manufacturer's recommendations using the TruSeq Stranded mRNA Sample Prep Kit (Illimina #RS-122-2101). Sequencing was performed with the Illumina TruSeq SBS Kit v4-HS chemistry (Illumina #FC-401-4003) on an Illumina HiSeq2500 instrument with 50 cycles of 2x50 bp paired-end sequencing.
      </Extract-Protocol>
    </Channel>
    <Data-Processing>
Illumina CASAVA v1.8.2  software used for basecalling and fastq file generation
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to Escherichia coli str. K-12 substr. MG1655, complete genome (GenBank: U00096) genome using bowtie2
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.
Genome_build: Escherichia coli str. K-12 substr. MG1655, complete genome (GenBank: U00096)
Supplementary_files_format_and_content: tab-delimited text files in GCT format include read counts of uniquely and fraction of multiple mapped reads (counts.gct.gz), and normalized counts RPKM (rpkms.gct.gz) values for each sample
    </Data-Processing>
    <Platform-Ref ref="GPL20227" />
    <Library-Strategy>RNA-Seq</Library-Strategy>
    <Library-Source>transcriptomic</Library-Source>
    <Library-Selection>cDNA</Library-Selection>
    <Instrument-Model>
      <Predefined>Illumina HiSeq 2500</Predefined>
    </Instrument-Model>
    <Contact-Ref ref="contrib1" />
    <Supplementary-Data type="unknown">
NONE
    </Supplementary-Data>
    <Relation type="BioSample" target="https://www.ncbi.nlm.nih.gov/biosample/SAMN08466802" />
    <Relation type="SRA" target="https://www.ncbi.nlm.nih.gov/sra?term=SRX3648429" />
  </Sample>

Here's the parser I'm working on but it's missing so many of the fields.

import xml.etree.ElementTree as ET
import pandas as pd

def read_geo_xml(path, index_name=None):
    # Parse the XML tree
    tree = ET.parse(path)
    root = tree.getroot()
    # Extract the attributes
    data = defaultdict(dict)
    for record in root:
        id_record = record.attrib["iid"]
        for x in record.findall("*"):
            for y in x:
                for k,v in y.attrib.items():
                    data[id_record][(k,v)] = y.text.strip()

    # Create pd.DataFrame
    df = pd.DataFrame(data).T
    df.index.name = index_name
    return df

url = "https://pastebin.com/raw/AJp5pshP"
import requests
from io import StringIO
text = requests.get("https://pastebin.com/raw/AJp5pshP").text
xml_data = StringIO(text)
df = read_geo_xml(xml_data)
df.head()
#   taxid   tag
# 679895    strain  type    moa phenotype   treatment time  treatment concentration
# GSM2978339    Escherichia coli BW25113    BW25113 Gram-negative bacteria  cell wall synthesis inhibitor / lipoprotein EC90 of phenotype   ~ 25 min    200 uM
# GSM2978340    Escherichia coli BW25113    BW25113 Gram-negative bacteria  cell wall synthesis inhibitor / lipoprotein EC90 of phenotype   ~ 25 min    200 uM
# GSM2978341    Escherichia coli BW25113    BW25113 Gram-negative bacteria  cell wall synthesis inhibitor / lipoprotein EC90 of phenotype   ~ 25 min    200 uM
# GSM2978342    Escherichia coli BW25113    BW25113 Gram-negative bacteria  new hit EC90 of phenotype   ~ 25 min    50 uM
# GSM2978343    Escherichia coli BW25113    BW25113 Gram-negative bacteria  new hit EC90 of phenotype   ~ 25 min    50 uM

Expected output:

# Everything within a <field>  </field>
Submission-Date
Release-Date
Last-Update-Date
Title
Accession
Type
Channel-Count
Source
Organism
Treatment-Protocol
Growth-Protocol
Molecule
Data-Processing
Library-Strategy
Library-Source
Library-Selection
Instrument-Model
Supplemental Data

# Everything under <Characteristics>
strain
type
moa
phenotype
treatment time
treatment concentration

I'm currently only able to pull from the "Characteristics"

can u share ur expected output? – sammywemmy May 19 '20 at 02:54 — sammywemmy, May 19 '20 at 02:54

score 0 · Answer 1 · edited Sep 02 '22 at 23:37

I'll use parsel to extract the Title data, using xpath :

 data = """[your data above]"""
    selector = Selector(data)

Get the data for the characteristics node :

    #all characteristics node have an attribute tag,
    #which is not found in the others, so I'll use that
    #characteristics
tags = []
contents = []
for ent in selector.xpath(".//sample//*[@tag]"):
    contents.append(ent.xpath("./text()").get().strip())
    tags.append(ent.attrib.get('tag'))
xters = dict(zip(tags,contents))

Get the data from other nodes, except characteristics :

elements = []
vals = []

#this searches through the nodes and excludes characteristics
for ent in selector.xpath(".//sample//*[not(self::characteristics)]"):
    #some nodes have no text, so we have to cater to that
    if not ent.xpath("./text()").get():
        continue
    elements.append(ent.xpath("name(.)").get())
    vals.append(ent.xpath("./text()").get().strip())

#create dictionary from the two lists
#and append the xters dict to form one main dict
results = dict(zip(elements,vals))
results.update(xters)


print(results)

{'status': '',
 'submission-date': '2018-02-05',
 'release-date': '2019-03-25',
 'last-update-date': '2019-03-25',
 'title': 'PDD_P2_70',
 'accession': 'GSM2978341',
 'type': 'Gram-negative bacteria',
 'channel-count': '1',
 'channel': '',
 'source': 'AZ-LolCDE',
 'organism': 'Escherichia coli BW25113',
 'treatment-protocol': 'bacteria were treated with different antibiotics for ~ 25 min till  ~OD 0.2  in 2 ml tubes',
 'growth-protocol': 'bacteria were grown in iso-sensitest medium',
 'molecule': 'total RNA',
 'extract-protocol': "after treament bacteria were resuspended in QiaGen RNAprotect Bacteria Reagent (QiaGen #76506), incubated for 5min, centrifuged, and flash frozen on dry ice. Total RNA was extracted by incubating bacteria in Enzymatic Lysis Buffer  (lysozyme & proteinase K) for 5 min followed by addition of QiaGen RLT Lysis Buffer and RNA purification  using the QiaGen RNeasy Mini kit combined with DNase treatment on a solid support (QiaGen #74104). RNA quality assessment and quantification was performed using microfluidic chip analysis on an Agilent 2100 bioanalyzer (Agilent Technologies).\nFor RNA-sequencing library preparation, 1000 ng total RNA was used as input. First, bacterial ribosomal RNA was depleted using the Ribo-Zero Magnetic Kit Bacteria (Illumina #MRZB12424). After depletion, RNA was resuspended in TruSeq Total RNA Sample Prep Kit Fragmentation buffer (8.5 ul RNA and 8.5 buffer) and reversed transcribed into cDNA using random hexamer primer. Then cDNA was further processed for the construction of sequencing libraries according to the manufacturer's recommendations using the TruSeq Stranded mRNA Sample Prep Kit (Illimina #RS-122-2101). Sequencing was performed with the Illumina TruSeq SBS Kit v4-HS chemistry (Illumina #FC-401-4003) on an Illumina HiSeq2500 instrument with 50 cycles of 2x50 bp paired-end sequencing.",
 'data-processing': 'Illumina CASAVA v1.8.2  software used for basecalling and fastq file generation\nSequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to Escherichia coli str. K-12 substr. MG1655, complete genome (GenBank: U00096) genome using bowtie2\nReads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.\nGenome_build: Escherichia coli str. K-12 substr. MG1655, complete genome (GenBank: U00096)\nSupplementary_files_format_and_content: tab-delimited text files in GCT format include read counts of uniquely and fraction of multiple mapped reads (counts.gct.gz), and normalized counts RPKM (rpkms.gct.gz) values for each sample',
 'library-strategy': 'RNA-Seq',
 'library-source': 'transcriptomic',
 'library-selection': 'cDNA',
 'instrument-model': '',
 'predefined': 'Illumina HiSeq 2500',
 'supplementary-data': 'NONE',
 'strain': 'BW25113',
 'moa': 'cell wall synthesis inhibitor / lipoprotein',
 'phenotype': 'EC90 of phenotype',
 'treatment time': '~ 25 min',
 'treatment concentration': '200 uM'}

You can read in your data into a dataframe :

pd.DataFrame.from_dict(results,orient='index')

This would require me to know the "Title" was one of the fields. I want to grab everything under Sample. — O.rka, May 19 '20 at 16:23
that's vague. some elements have attributes? u want that as well? kindly add ur expected output to ur question. — sammywemmy, May 19 '20 at 22:49
I've added to the question to show the expected fields and what I'm getting. — O.rka, May 20 '20 at 01:08

score 0 · Answer 2 · answered May 19 '20 at 14:28

An example.

from simplified_scrapy import SimplifiedDoc, utils

def foo(ele, row):
  children = ele.children
  for a in ele:
      if a != 'html' and a != 'tag': row.append(ele[a])
  if children:
    for child in children:
      foo(child,row)
  elif ele['html']:
    row.append(ele['html'])

html = '''
<Sample iid="GSM2978341">
    <Status database="GEO">
      <Submission-Date>2018-02-05</Submission-Date>
      <Release-Date>2019-03-25</Release-Date>
      <Last-Update-Date>2019-03-25</Last-Update-Date>
    </Status>
    <Title>PDD_P2_70</Title>
    <Accession database="GEO">GSM2978341</Accession>
    <Type>SRA</Type>
</Sample>
'''
doc = SimplifiedDoc(html)
row = []
foo(doc,row)
print (row)

Result:

['GSM2978341', 'GEO', '2018-02-05', '2019-03-25', '2019-03-25', 'PDD_P2_70', 'GEO', 'GSM2978341', 'SRA']

How to get all relevant fields from a XML file into a pandas dataframe in Python using xml.etree.ElementTree?

2 Answers2