0

I am try to re-learn python so my skills are lacking. I am currently playing with the Pubmed APIs. I am trying to parse the XML file that is given here, and then run a loop to go through each child ('/pubmedarticle') and grab a few things, for now just the article title, and enter them into a dictionary under the key of the pubmedid (pmid).

i.e. the output should look like:

{'29150897': {'title': 'Determining best outcomes from community-acquired pneumonia and how to achieve them.'} 
'29149862': {'title': 'Telemedicine as an effective intervention to improve antibiotic appropriateness prescription and to reduce costs in pediatrics.'}}

Later I will add in more factors like author and journal etc, for now I just want to figure out how to use lxml package to get the data I want into a dictionary. I know there are plenty of packages that can do this for me, but that defeats the purpose of learning. I've tried a bunch of different things and this is what I'm currently trying to do:

from lxml import etree    
article_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
page = requests.get(article_url)
tree = etree.fromstring(page.content)

dict_out = {}

for x in tree.xpath('//PubmedArticle'):
    pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])
    title = ''.join([x.text for x in x.xpath('//ArticleTitle')])

    dict_out[pmid] = {'title': title}

print(dict_out)

I probably have a misunderstanding about how to go about this process, but if anyone can offer insight or lead me in the right direction for resources, that would be greatly appreciated.

Edit: My apologies. I wrote this far quicker than I should have. I have fixed up all the cases. Also, the result it throws seems to combine the PMIDs while just giving the first title:

{'2725403628806902': {'title': 'Handshake Stewardship: A Highly Effective Rounding-based Antimicrobial Optimization Service.Monitoring, documenting and reporting the quality of antibiotic use in the Netherlands: a pilot study to establish a national antimicrobial stewardship registry.'}}

Ta

2 Answers2

3

code.py:

#!/usr/bin/env python3

import sys
import requests
from lxml import etree
from pprint import pprint as pp

ARTICLE_URL = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"


def main():
    response = requests.get(ARTICLE_URL)
    tree = etree.fromstring(response.content)
    ids = tree.xpath("//MedlineCitation/PMID[@Version='1']")
    titles = tree.xpath("//Article/ArticleTitle")
    if len(ids) != len(titles):
        print("ID count doesn't match Title count...")
        return
    result = {_id.text: {"title": title.text} for _id, title in zip(ids, titles)}
    pp(result)


if __name__ == "__main__":
    print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
    main()

Notes:

  • I structured the code a little bit and renamed some variables for clarity
  • ids holds the list of PMID nodes, while titles holds the list of (corresponding) ArticleTitle nodes (notice the paths!)
  • The way to join them together in the desired format is using a [Python]: PEP 274 -- Dict Comprehensions, and for iterating on 2 list at the same time, [Python 3]: zip(*iterables) was used

Output:

(py35x64_test) c:\Work\Dev\StackOverflow\q47433632>"c:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe" code.py
Python 3.5.4 (v3.5.4:3f56838, Aug  8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32

{'29149862': {'title': 'Telemedicine as an effective intervention to improve '
                       'antibiotic appropriateness prescription and to reduce '
                       'costs in pediatrics.'},
 '29150897': {'title': 'Determining best outcomes from community-acquired '
                       'pneumonia and how to achieve them.'}}
CristiFati
  • 38,250
  • 9
  • 50
  • 87
0

First of all, xml is case-sensitive, and you are using lowercase tags in xpath.

Also i believe pmid should be some number (or string representing number), and in your case this seems to be something different:

In my tests

`pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])` 

produces string of concatenated numbers, which is not what you are looking for.

running.t
  • 5,329
  • 3
  • 32
  • 50
  • Thanks running, I have fixed the cases. You are right that's what that line is doing, just shoving them all together. I was hoping that the 'for' loop would just work through each '//PubMedArticle' node instead of grabbing them both at once. I have no idea how to get them iteratively. – Banana Mannock Nov 22 '17 at 12:07