Parsing Pubmed API xml with lxml then grabbing children into dictionary

Question

I am try to re-learn python so my skills are lacking. I am currently playing with the Pubmed APIs. I am trying to parse the XML file that is given here, and then run a loop to go through each child ('/pubmedarticle') and grab a few things, for now just the article title, and enter them into a dictionary under the key of the pubmedid (pmid).

i.e. the output should look like:

{'29150897': {'title': 'Determining best outcomes from community-acquired pneumonia and how to achieve them.'} 
'29149862': {'title': 'Telemedicine as an effective intervention to improve antibiotic appropriateness prescription and to reduce costs in pediatrics.'}}

Later I will add in more factors like author and journal etc, for now I just want to figure out how to use lxml package to get the data I want into a dictionary. I know there are plenty of packages that can do this for me, but that defeats the purpose of learning. I've tried a bunch of different things and this is what I'm currently trying to do:

from lxml import etree    
article_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
page = requests.get(article_url)
tree = etree.fromstring(page.content)

dict_out = {}

for x in tree.xpath('//PubmedArticle'):
    pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])
    title = ''.join([x.text for x in x.xpath('//ArticleTitle')])

    dict_out[pmid] = {'title': title}

print(dict_out)

I probably have a misunderstanding about how to go about this process, but if anyone can offer insight or lead me in the right direction for resources, that would be greatly appreciated.

Edit: My apologies. I wrote this far quicker than I should have. I have fixed up all the cases. Also, the result it throws seems to combine the PMIDs while just giving the first title:

{'2725403628806902': {'title': 'Handshake Stewardship: A Highly Effective Rounding-based Antimicrobial Optimization Service.Monitoring, documenting and reporting the quality of antibiotic use in the Netherlands: a pilot study to establish a national antimicrobial stewardship registry.'}}

Ta

What is the problem? Does the code produce anything? Are there errors? — mzjn, Nov 22 '17 at 11:36
`medlinecitation` should be `MedlineCitation`. `pmid` should be `PMID`. And so on. Case is significant! — mzjn, Nov 22 '17 at 11:43
Thanks mzkn! I have hopefully fixed up the question for readers. — Banana Mannock, Nov 22 '17 at 12:05

CristiFati · Accepted Answer · 2018-12-30T20:20:51.393

code.py:

#!/usr/bin/env python3

import sys
import requests
from lxml import etree
from pprint import pprint as pp

ARTICLE_URL = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"


def main():
    response = requests.get(ARTICLE_URL)
    tree = etree.fromstring(response.content)
    ids = tree.xpath("//MedlineCitation/PMID[@Version='1']")
    titles = tree.xpath("//Article/ArticleTitle")
    if len(ids) != len(titles):
        print("ID count doesn't match Title count...")
        return
    result = {_id.text: {"title": title.text} for _id, title in zip(ids, titles)}
    pp(result)


if __name__ == "__main__":
    print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
    main()

Notes:

I structured the code a little bit and renamed some variables for clarity
ids holds the list of PMID nodes, while titles holds the list of (corresponding) ArticleTitle nodes (notice the paths!)
The way to join them together in the desired format is using a [Python]: PEP 274 -- Dict Comprehensions, and for iterating on 2 list at the same time, [Python 3]: zip(*iterables) was used

Output:

(py35x64_test) c:\Work\Dev\StackOverflow\q47433632>"c:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe" code.py
Python 3.5.4 (v3.5.4:3f56838, Aug  8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32

{'29149862': {'title': 'Telemedicine as an effective intervention to improve '
                       'antibiotic appropriateness prescription and to reduce '
                       'costs in pediatrics.'},
 '29150897': {'title': 'Determining best outcomes from community-acquired '
                       'pneumonia and how to achieve them.'}}

score 0 · Answer 2 · answered Nov 22 '17 at 11:45

0

First of all, xml is case-sensitive, and you are using lowercase tags in xpath.

Also i believe pmid should be some number (or string representing number), and in your case this seems to be something different:

In my tests

`pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])`

produces string of concatenated numbers, which is not what you are looking for.

answered Nov 22 '17 at 11:45

running.t

5,329
3
32
50

Thanks running, I have fixed the cases. You are right that's what that line is doing, just shoving them all together. I was hoping that the 'for' loop would just work through each '//PubMedArticle' node instead of grabbing them both at once. I have no idea how to get them iteratively. – Banana Mannock Nov 22 '17 at 12:07

Parsing Pubmed API xml with lxml then grabbing children into dictionary

2 Answers2