This XML sample represents a sample Metabolite from the HMDB the Serum Metabolites
dataset.
<?xml version="1.0" encoding="UTF-8"?>
<hmdb xmlns="http://www.hmdb.ca">
<metabolite>
<version>4.0</version>
<creation_date>2005-11-16 15:48:42 UTC</creation_date>
<update_date>2019-01-11 19:13:56 UTC</update_date>
<accession>HMDB0000001</accession>
<status>quantified</status>
<secondary_accessions>
<accession>HMDB00001</accession>
<accession>HMDB0004935</accession>
<accession>HMDB0006703</accession>
<accession>HMDB0006704</accession>
<accession>HMDB04935</accession>
<accession>HMDB06703</accession>
<accession>HMDB06704</accession>
</secondary_accessions>
<name>1-Methylhistidine</name>
<cs_description>1-Methylhistidine, also known as 1-mhis, belongs to the class of organic compounds known as histidine and derivatives. Histidine and derivatives are compounds containing cysteine or a derivative thereof resulting from reaction of cysteine at the amino group or the carboxy group, or from the replacement of any hydrogen of glycine by a heteroatom. 1-Methylhistidine has been found in human muscle and skeletal muscle tissues, and has also been detected in most biofluids, including cerebrospinal fluid, saliva, blood, and feces. Within the cell, 1-methylhistidine is primarily located in the cytoplasm. 1-Methylhistidine participates in a number of enzymatic reactions. In particular, 1-Methylhistidine and Beta-alanine can be converted into anserine; which is catalyzed by the enzyme carnosine synthase 1. In addition, Beta-Alanine and 1-methylhistidine can be biosynthesized from anserine; which is mediated by the enzyme cytosolic non-specific dipeptidase. In humans, 1-methylhistidine is involved in the histidine metabolism pathway. 1-Methylhistidine is also involved in the metabolic disorder called the histidinemia pathway.</cs_description>
<description>One-methylhistidine (1-MHis) is derived mainly from the anserine of dietary flesh sources, especially poultry. The enzyme, carnosinase, splits anserine into b-alanine and 1-MHis. High levels of 1-MHis tend to inhibit the enzyme carnosinase and increase anserine levels. Conversely, genetic variants with deficient carnosinase activity in plasma show increased 1-MHis excretions when they consume a high meat diet. Reduced serum carnosinase activity is also found in patients with Parkinson's disease and multiple sclerosis and patients following a cerebrovascular accident. Vitamin E deficiency can lead to 1-methylhistidinuria from increased oxidative effects in skeletal muscle. 1-Methylhistidine is a biomarker for the consumption of meat, especially red meat.</description>
<synonyms>
<synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid</synonym>
<synonym>1-Methylhistidine</synonym>
<synonym>Pi-methylhistidine</synonym>
<synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate</synonym>
<synonym>1 Methylhistidine</synonym>
<synonym>1-Methyl histidine</synonym>
</synonyms>
<chemical_formula>C7H11N3O2</chemical_formula>
<smiles>CN1C=NC(C[C@H](N)C(O)=O)=C1</smiles>
<inchikey>BRMWTNUJHUMWMS-LURJTMIESA-N</inchikey>
<diseases>
<disease>
<name>Kidney disease</name>
<omim_id/>
<references>
<reference>
<reference_text>McGregor DO, Dellow WJ, Lever M, George PM, Robson RA, Chambers ST: Dimethylglycine accumulates in uremia and predicts elevated plasma homocysteine concentrations. Kidney Int. 2001 Jun;59(6):2267-72.</reference_text>
<pubmed_id>11380830</pubmed_id>
</reference>
<reference>
<reference_text>Ehrenpreis ED, Salvino M, Craig RM: Improving the serum D-xylose test for the identification of patients with small intestinal malabsorption. J Clin Gastroenterol. 2001 Jul;33(1):36-40.</reference_text>
<pubmed_id>11418788</pubmed_id>
</reference>
</references>
</disease>
</diseases>
What I'm trying to do is to run a nested loops and create a list of dictionaries.
Every dictionary will represent one metabolite.
Each of the keys in a dictionary will be selected nodes (by tags name).
The values of the keys will be either list of strings or a single string.
This is the structure I think is needed (better ideas are also welcome):
[
{
"accession":"accession.value",
"name": "name.value",
"synonyms":[synonyms.value.1, synonyms.value.2, synonyms.value.3,... ],
"chemical_formula":"chemical_formula.value",
"smiles": "smiles.value",
"inchikey":"inchikey.value",
"biological_properties_pathways":[pathways.value1, pathways.value2, pathways.value3,.. ]
"diseases":[disease.name.1, disease.name.2, disease.name.3,.. ]
"pubmed_id's for disease.name.1":[pubmed_id.value.1, pubmed_id.value.2, pubmed_id.value.3,... ]
"pubmed_id's for disease.name.2":[pubmed_id.value.1, pubmed_id.value.2, pubmed_id.value.3,... ]
.
.
.
},
{"accession":"accession.value",
"name": "name.value",
"synonyms":[synonyms.value.1, synonyms.value.2, synonyms.value.3,... ],
"chemical_formula":"chemical_formula.value",
"smiles": "smiles.value",
"inchikey":"inchikey.value",
"biological_properties_pathways":[pathways.value1, pathways.value2, pathways.value3,.. ]
"diseases":[disease.name.1, disease.name.2, disease.name.3,.. ]
"pubmed_id's for disease.name.1":[pubmed_id.value.1, pubmed_id.value.2, pubmed_id.value.3,... ]
"pubmed_id's for disease.name.2":[pubmed_id.value.1, pubmed_id.value.2, pubmed_id.value.3,... ]
.
.
.
},
.
.
.
]
This is what I did so far
# Import packges
from xml.dom import minidom
import xml.etree.ElementTree as et
# load data
data1 = et.parse('D:/path/to/my/Projects/HMDB/DataSets/saliva_metabolites/saliva_metabolites.xml')
# create name space
ns = {"h": "http://www.hmdb.ca"}
# extract the first 3 metabolites only for easy work
metabolites = root.findall('./h:metabolite', ns) [0:3]
Now running the nested loop on the 3 metabolites and select specific nodes (the first 2 I needed) as dictionaries.
newlist = []
for child in metabolites:
innerlist = []
dicts = {}
for subchild in child:
if subchild.tag=='{http://www.hmdb.ca}accession':
dicts={"accession": subchild.text}
if subchild.tag == '{http://www.hmdb.ca}name':
dicts = {"name": subchild.text}
innerlist.append(subchild.text)
print(innerlist)
newlist.append(dicts)
I received this output:
>> print(newlist)
[{'name': '1-Methylhistidine'}, {'name': '2-Ketobutyric acid'}, {'name': '2-Hydroxybutyric acid'}]
instead of
[{'accession': 'HMDB0000001','name': '1-Methylhistidine' },
{'accession': 'HMDB0000005', 'name': '2-Ketobutyric acid'},
{'accession': 'HMDB0000008', 'name': '2-Hydroxybutyric acid'}]
meaning the <name>
surpasses <accession>
.
Also tried to enter a list as values for a key
newlist = []
for child in metabolites:
innerlist = []
dicts = {}
for subchild in child:
# if subchild.tag=='{http://www.hmdb.ca}accession':
# dicts={"accession": subchild.text}
# if subchild.tag == '{http://www.hmdb.ca}name':
# dicts = {"name": subchild.text}
if subchild.tag == '{http://www.hmdb.ca}synonyms':
for synonym in subchild:
dicts = {"synonyms": synonym.text}
print(synonym.text)
innerlist.append(subchild.text)
print(innerlist)
newlist.append(dicts)
innerlist.append(subchild.text)
newlist.append(innerlist)
And the output again is surpassed:
>> print(newlist)
[{'synonyms': '1-Methylhistidine dihydrochloride'},
{'synonyms': 'alpha-Ketobutyric acid, sodium salt'},
{'synonyms': '2-Hydroxybutyric acid, monosodium salt, (+-)-isomer'}]
Each of the 3 keys above contains the last values from each list, instead of a list of values.
should have received something like that (but with all values per synonym):
>> print(newlist)
[{'synonyms': ['(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid',
'1-Methylhistidine',
....
'1-Methylhistidine dihydrochloride' ]},
{'synonyms': ['2-Ketobutanoic acid',
'2-Oxobutyric acid',
....
'alpha-Ketobutyric acid, sodium salt']},
{'synonyms': [ '2-Hydroxybutanoic acid',
'alpha-Hydroxybutanoic acid',
....
'2-Hydroxybutyric acid, monosodium salt, (+-)-isomer']}
]
I was using those questions to write the loop:
- Create List of Dictionary Python - I think is very similar but can't make it work
- How to create and fill a list of lists in a for loop
- Python ElementTree - iterate through child nodes and text in order
- Populating a dictionary using for loops (python) [duplicate]
- Generating nested lists from XML doc
Any thoughts, hints, clues or ideas would be greatly appreciated