First time parsing an xml
file and I'm following both this pandas explanation and this SO question. I have an xml file from pubmed (any should work but I downloaded the first one: pubmed22n1115.xml
). This file seems to be very convoluted and much more complex than the SO/pandas explanations and I can't seem to be able to parse it.
What I tried is:
import pandas as pd
df = pd.read_xml('../../Downloads/pubmed22n1115.xml')
df.head()
>>>
MedlineCitation PubmedData PMID
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
All the other examples I looked at for parsing xml files were very specific to the xml file structure and I can't seem to follow.
The only 2 things I need from this file are PMID
, AbstractText
. The expected output is a pandas dataframe that looks like
PMID AbstractText
0 1212 text1
1 1233 text2