I'm trying to get bulk data from Europe PMC annotations api in python

Question

my code is

if name == 'main': json_data=requests.get("https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds?articleIds=PMC%3A4771370&section=Abstract&provider=Europe%20PMC&format=JSON").content r=json.loads(json_data) df = json_to_dataframe(r) print(df)

My only problem is how can run this for multiple IDs, like i have atleast thousands of ids in a file. Please help I'm using python.

Siddhartha · Accepted Answer · 2022-02-05T10:49:56.783

0

Assuming you know Python and can get all the IDs from the file into a list article_ids, you can use the following script:

URL = 'https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds'

article_ids = ['PMC:4771370']

for article_id in article_ids:
    params = {
        'articleIds': article_id,
        'section': 'Abstract',
        'provider': 'Europe PMC',
        'format': 'JSON'
    }
    json_data = requests.get(URL, params=params).content
    r = json.loads(json_data)
    df = json_to_dataframe(r)
    print(df)

edited Feb 05 '22 at 10:49

answered Feb 05 '22 at 10:31

Siddhartha

311
3
8

hi sidharth,i have updated the info in the question, can you please see again. – Arvind Feb 05 '22 at 10:44
thanks sidharth, can you also tell me how can i open my file, because i'm using readlines which is not giving the results. – Arvind Feb 05 '22 at 11:04
@Arvind, perhaps a [Python tutorial](https://www.freecodecamp.org/news/python-open-file-how-to-read-a-text-file-line-by-line/) can help. – Siddhartha Feb 05 '22 at 11:12
@sidhartha already tried that, maybe i'm implementing wrong, can you help? – Arvind Feb 05 '22 at 11:21

score 0 · Answer 2 · answered Feb 05 '22 at 11:09

After analyzing the shared URL and reading the URL Encodings article, I observed that each value of annotationByArticleIDs has format of SOURCE:EXTERNAL_ID format.

TEST1: If you hit the url:

https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds?articleIds=PMC

Output is: It must contain values with format SOURCE:EXTERNAL_ID where SOURCE must have one of the following values [PMC, MED, PAT, AGR, CBA, HIR, CTX, ETH, CIT, PPR, NBK] and EXTERNAL_ID must be a number when SOURCE=PMC

Above output shows possible list of sources
Each source is separated by EXTERNAL_ID using colon
Colon is represented by %3A in URL Encoding article
In order to separate one pair of value from another value, you could use comma operator
Comma is represented using %2C in the same URL encoding article

ANSWER: So to fetch multiple articles you could generate string of article ids in the format SOURCE1:EXTERNAL_ID1,SOURCE2:EXTERNAL_ID2 .... SOURCE3:EXTERNAL_ID3 and append in the main url

Few Limitations:

Max URL Length could be 2048 characters
Depending upon possible ids, you will be able to fetch around 150 to 200 articles
You could loop over a batch of 150 and then fetch the required information

is it'll be a good idea for 100k ids? i'm new to programming so it'll be good idea if you provide sone code solution. — Arvind, Feb 05 '22 at 11:13

I'm trying to get bulk data from Europe PMC annotations api in python

2 Answers2