I'm trying to parse an XML file from gene expression omnibus. I found out how to get some of the data fields but I can't figure out how to get info like like <Title>
.
I tried adapting: How to convert an XML file to nice pandas dataframe? but was only able to get some of the information.
How can I extract all of the available data into a pandas dataframe?
Here's an example of the XML file:
<Sample iid="GSM2978341">
<Status database="GEO">
<Submission-Date>2018-02-05</Submission-Date>
<Release-Date>2019-03-25</Release-Date>
<Last-Update-Date>2019-03-25</Last-Update-Date>
</Status>
<Title>PDD_P2_70</Title>
<Accession database="GEO">GSM2978341</Accession>
<Type>SRA</Type>
<Channel-Count>1</Channel-Count>
<Channel position="1">
<Source>AZ-LolCDE</Source>
<Organism taxid="679895">Escherichia coli BW25113</Organism>
<Characteristics tag="strain">
BW25113
</Characteristics>
<Characteristics tag="type">
Gram-negative bacteria
</Characteristics>
<Characteristics tag="moa">
cell wall synthesis inhibitor / lipoprotein
</Characteristics>
<Characteristics tag="phenotype">
EC90 of phenotype
</Characteristics>
<Characteristics tag="treatment time">
~ 25 min
</Characteristics>
<Characteristics tag="treatment concentration">
200 uM
</Characteristics>
<Treatment-Protocol>
bacteria were treated with different antibiotics for ~ 25 min till ~OD 0.2 in 2 ml tubes
</Treatment-Protocol>
<Growth-Protocol>
bacteria were grown in iso-sensitest medium
</Growth-Protocol>
<Molecule>total RNA</Molecule>
<Extract-Protocol>
after treament bacteria were resuspended in QiaGen RNAprotect Bacteria Reagent (QiaGen #76506), incubated for 5min, centrifuged, and flash frozen on dry ice. Total RNA was extracted by incubating bacteria in Enzymatic Lysis Buffer (lysozyme & proteinase K) for 5 min followed by addition of QiaGen RLT Lysis Buffer and RNA purification using the QiaGen RNeasy Mini kit combined with DNase treatment on a solid support (QiaGen #74104). RNA quality assessment and quantification was performed using microfluidic chip analysis on an Agilent 2100 bioanalyzer (Agilent Technologies).
For RNA-sequencing library preparation, 1000 ng total RNA was used as input. First, bacterial ribosomal RNA was depleted using the Ribo-Zero Magnetic Kit Bacteria (Illumina #MRZB12424). After depletion, RNA was resuspended in TruSeq Total RNA Sample Prep Kit Fragmentation buffer (8.5 ul RNA and 8.5 buffer) and reversed transcribed into cDNA using random hexamer primer. Then cDNA was further processed for the construction of sequencing libraries according to the manufacturer's recommendations using the TruSeq Stranded mRNA Sample Prep Kit (Illimina #RS-122-2101). Sequencing was performed with the Illumina TruSeq SBS Kit v4-HS chemistry (Illumina #FC-401-4003) on an Illumina HiSeq2500 instrument with 50 cycles of 2x50 bp paired-end sequencing.
</Extract-Protocol>
</Channel>
<Data-Processing>
Illumina CASAVA v1.8.2 software used for basecalling and fastq file generation
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to Escherichia coli str. K-12 substr. MG1655, complete genome (GenBank: U00096) genome using bowtie2
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.
Genome_build: Escherichia coli str. K-12 substr. MG1655, complete genome (GenBank: U00096)
Supplementary_files_format_and_content: tab-delimited text files in GCT format include read counts of uniquely and fraction of multiple mapped reads (counts.gct.gz), and normalized counts RPKM (rpkms.gct.gz) values for each sample
</Data-Processing>
<Platform-Ref ref="GPL20227" />
<Library-Strategy>RNA-Seq</Library-Strategy>
<Library-Source>transcriptomic</Library-Source>
<Library-Selection>cDNA</Library-Selection>
<Instrument-Model>
<Predefined>Illumina HiSeq 2500</Predefined>
</Instrument-Model>
<Contact-Ref ref="contrib1" />
<Supplementary-Data type="unknown">
NONE
</Supplementary-Data>
<Relation type="BioSample" target="https://www.ncbi.nlm.nih.gov/biosample/SAMN08466802" />
<Relation type="SRA" target="https://www.ncbi.nlm.nih.gov/sra?term=SRX3648429" />
</Sample>
Here's the parser I'm working on but it's missing so many of the fields.
import xml.etree.ElementTree as ET
import pandas as pd
def read_geo_xml(path, index_name=None):
# Parse the XML tree
tree = ET.parse(path)
root = tree.getroot()
# Extract the attributes
data = defaultdict(dict)
for record in root:
id_record = record.attrib["iid"]
for x in record.findall("*"):
for y in x:
for k,v in y.attrib.items():
data[id_record][(k,v)] = y.text.strip()
# Create pd.DataFrame
df = pd.DataFrame(data).T
df.index.name = index_name
return df
url = "https://pastebin.com/raw/AJp5pshP"
import requests
from io import StringIO
text = requests.get("https://pastebin.com/raw/AJp5pshP").text
xml_data = StringIO(text)
df = read_geo_xml(xml_data)
df.head()
# taxid tag
# 679895 strain type moa phenotype treatment time treatment concentration
# GSM2978339 Escherichia coli BW25113 BW25113 Gram-negative bacteria cell wall synthesis inhibitor / lipoprotein EC90 of phenotype ~ 25 min 200 uM
# GSM2978340 Escherichia coli BW25113 BW25113 Gram-negative bacteria cell wall synthesis inhibitor / lipoprotein EC90 of phenotype ~ 25 min 200 uM
# GSM2978341 Escherichia coli BW25113 BW25113 Gram-negative bacteria cell wall synthesis inhibitor / lipoprotein EC90 of phenotype ~ 25 min 200 uM
# GSM2978342 Escherichia coli BW25113 BW25113 Gram-negative bacteria new hit EC90 of phenotype ~ 25 min 50 uM
# GSM2978343 Escherichia coli BW25113 BW25113 Gram-negative bacteria new hit EC90 of phenotype ~ 25 min 50 uM
Expected output:
# Everything within a <field> </field>
Submission-Date
Release-Date
Last-Update-Date
Title
Accession
Type
Channel-Count
Source
Organism
Treatment-Protocol
Growth-Protocol
Molecule
Data-Processing
Library-Strategy
Library-Source
Library-Selection
Instrument-Model
Supplemental Data
# Everything under <Characteristics>
strain
type
moa
phenotype
treatment time
treatment concentration
I'm currently only able to pull from the "Characteristics"