Pandas "read_xml" returns NaNs

Question

First time parsing an xml file and I'm following both this pandas explanation and this SO question. I have an xml file from pubmed (any should work but I downloaded the first one: pubmed22n1115.xml). This file seems to be very convoluted and much more complex than the SO/pandas explanations and I can't seem to be able to parse it.

What I tried is:

import pandas as pd
df = pd.read_xml('../../Downloads/pubmed22n1115.xml')
df.head()
>>>
    MedlineCitation PubmedData  PMID
0   NaN NaN NaN
1   NaN NaN NaN
2   NaN NaN NaN
3   NaN NaN NaN

All the other examples I looked at for parsing xml files were very specific to the xml file structure and I can't seem to follow. The only 2 things I need from this file are PMID, AbstractText. The expected output is a pandas dataframe that looks like

    PMID    AbstractText
0   1212    text1   
1   1233    text2

It's pretty big so I'm not sure I can, sorry. Is that critical? — Penguin, Aug 02 '22 at 20:20
Nvm, I see you posted the link, downloaded a file from there, trying some stuff on it, will update. — Barry the Platipus, Aug 02 '22 at 20:22

Barry the Platipus · Accepted Answer · 2022-08-02T22:20:43.877

You need to drill down into that huge XML file, in order to display the relevant data. You do this with xpath in pandas, like so (this is on a random xml doc downloaded from that link):

import pandas as pd

df = pd.read_xml('pubmed22n1123.xml/pubmed22n1123.xml', xpath=".//PMID")
print(df)

This will print out in terminal:

Version PMID
0   1   14584002
1   1   16916636
2   1   34919821
3   1   17541330
4   1   17643379
... ... ...
18359   1   34919510
18360   1   34919742
18361   1   34919747
18362   1   34919751
18363   1   34919752

The following pandas documentation might be helpful:

https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html

EDIT: You can get AbstractText with:

df = pd.read_xml('pubmed22n1123.xml/pubmed22n1123.xml', xpath=".//AbstractText")
print(df)

Resulting in:

Label   NlmCategory AbstractText    sup i   sub b   u   {http://www.w3.org/1998/Math/MathML}math
0   BACKGROUND  BACKGROUND  Kawasaki disease is the most common cause of a...   None    None    None    None    None    NaN
1   OBJECTIVES  OBJECTIVE   The objective of this review was to evaluate t...   None    None    None    None    None    NaN
2   SEARCH STRATEGY METHODS Electronic searches of the Cochrane Peripheral...   None    None    None    None    None    NaN
3   SELECTION CRITERIA  METHODS Randomised controlled trials of intravenous im...   None    None    None    None    None    NaN
4   DATA COLLECTION AND ANALYSIS    METHODS Fifty-nine trials were identified in the initi...   None    None    None    None    None    NaN
... ... ... ... ... ... ... ... ... ...

This seems to mostly work. I'm trying something similar with the `AbstractText` (`pd.read_xml('../../Downloads/pubmed22n1115.xml', xpath=".//Abstract")`) but running into an issue. Mainly, the abstract that I need is under `` and then between 2 `` . However, there's also some `AbstractText` that looks like ` — Penguin, Aug 02 '22 at 20:55
You can filter out the 'objective' texts in dataframe... I'm not sure I understand you. — Barry the Platipus, Aug 02 '22 at 21:20
If you somehow hit a wall with pandas read_xml and xpath (though you should be alright with it), you can always try xml.etree.ElementTree, see this article from Medium: https://florian-kromer.medium.com/parsing-xml-into-pandas-dataframes-661882abd8e5 — Barry the Platipus, Aug 02 '22 at 21:30

Pandas "read_xml" returns NaNs

1 Answers1