1

First time parsing an xml file and I'm following both this pandas explanation and this SO question. I have an xml file from pubmed (any should work but I downloaded the first one: pubmed22n1115.xml). This file seems to be very convoluted and much more complex than the SO/pandas explanations and I can't seem to be able to parse it.

What I tried is:

import pandas as pd
df = pd.read_xml('../../Downloads/pubmed22n1115.xml')
df.head()
>>>
    MedlineCitation PubmedData  PMID
0   NaN NaN NaN
1   NaN NaN NaN
2   NaN NaN NaN
3   NaN NaN NaN

All the other examples I looked at for parsing xml files were very specific to the xml file structure and I can't seem to follow. The only 2 things I need from this file are PMID, AbstractText. The expected output is a pandas dataframe that looks like

    PMID    AbstractText
0   1212    text1   
1   1233    text2   
Penguin
  • 1,923
  • 3
  • 21
  • 51

1 Answers1

1

You need to drill down into that huge XML file, in order to display the relevant data. You do this with xpath in pandas, like so (this is on a random xml doc downloaded from that link):

import pandas as pd

df = pd.read_xml('pubmed22n1123.xml/pubmed22n1123.xml', xpath=".//PMID")
print(df)

This will print out in terminal:

Version PMID
0   1   14584002
1   1   16916636
2   1   34919821
3   1   17541330
4   1   17643379
... ... ...
18359   1   34919510
18360   1   34919742
18361   1   34919747
18362   1   34919751
18363   1   34919752

The following pandas documentation might be helpful:

https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html

EDIT: You can get AbstractText with:

df = pd.read_xml('pubmed22n1123.xml/pubmed22n1123.xml', xpath=".//AbstractText")
print(df)

Resulting in:

Label   NlmCategory AbstractText    sup i   sub b   u   {http://www.w3.org/1998/Math/MathML}math
0   BACKGROUND  BACKGROUND  Kawasaki disease is the most common cause of a...   None    None    None    None    None    NaN
1   OBJECTIVES  OBJECTIVE   The objective of this review was to evaluate t...   None    None    None    None    None    NaN
2   SEARCH STRATEGY METHODS Electronic searches of the Cochrane Peripheral...   None    None    None    None    None    NaN
3   SELECTION CRITERIA  METHODS Randomised controlled trials of intravenous im...   None    None    None    None    None    NaN
4   DATA COLLECTION AND ANALYSIS    METHODS Fifty-nine trials were identified in the initi...   None    None    None    None    None    NaN
... ... ... ... ... ... ... ... ... ...
Barry the Platipus
  • 9,594
  • 2
  • 6
  • 30
  • This seems to mostly work. I'm trying something similar with the `AbstractText` (`pd.read_xml('../../Downloads/pubmed22n1115.xml', xpath=".//Abstract")`) but running into an issue. Mainly, the abstract that I need is under `` and then between 2 `` . However, there's also some `AbstractText` that looks like ` – Penguin Aug 02 '22 at 20:55
  • You can filter out the 'objective' texts in dataframe... I'm not sure I understand you. – Barry the Platipus Aug 02 '22 at 21:20
  • If you somehow hit a wall with pandas read_xml and xpath (though you should be alright with it), you can always try xml.etree.ElementTree, see this article from Medium: https://florian-kromer.medium.com/parsing-xml-into-pandas-dataframes-661882abd8e5 – Barry the Platipus Aug 02 '22 at 21:30
  • Appreciate it : ) – Penguin Aug 02 '22 at 21:36