Bulk Data from Europe PMC annotation api

Question

i have a pmc.txt file which contains atleast 20k pmc ids, and the api will only take i think 1000 request each time. i have written the code for one id, but i'm not able to do for the whole file, below is my main code. Please help.

if __name__ == '__main__':
URL = 'https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds'


article_ids = ['PMC:4771370']

for article_id in article_ids:
  params = {
    'articleIds': article_id,
    'section': 'Abstract',
    'provider': 'Europe PMC',
    'format': 'JSON'
  }
json_data = requests.get(URL, params=params).content
r = json.loads(json_data)
df = json_to_dataframe(r)
print(df)
df.to_csv("data.csv")

What do you mean by it will only take 1000 requests each time? — Andrew Ryan, Feb 06 '22 at 06:42
@andrew like the article id say pmcid can only be sent 1000 so say one 20k id file should be divided into 1000 id each. — Arvind, Feb 06 '22 at 06:44
you need help reading in the data from the file? or what is it that you need help with doing? — Andrew Ryan, Feb 06 '22 at 06:47
@andew yeah read data from file then send the request to the json api url. — Arvind, Feb 06 '22 at 06:52

Andrew Ryan · Accepted Answer · 2022-02-06T08:36:27.683

you can read in the data from the file like so:

with open('pmc.txt', 'r') as file:
    article_ids = [item.replace('\n', '') for item in file]

which you can do instead of article_ids = ['PMC:4771370']

Though you are going to have to save your files with a different name (you will have 20,000 files then or instead you have to append your json data to the dataframe before you make it a csv)

Therefore this would be something that you would do to separate the ids into chunks and use them in a single articleid parameter

if __name__ == '__main__':
    URL = 'https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds'

    with open('corpus_processing_input.txt', 'r') as file:
        article_ids = [item.replace('\n', '') for item in file]
    
    # API only allows 1-8 ids sent at a time 
    chunks = [article_ids[x:x+8] for x in range(0, len(article_ids), 8)]

    for count, article_ids in enumerate(chunks):
        params = {
            'articleIds': ','.join(article_ids),
            'section': 'Abstract',
            'provider': 'Europe PMC',
            'format': 'JSON'
        }
        json_data = requests.get(URL, params=params).content
        r = json.loads(json_data)
        df = json_to_dataframe(r)
        print(df)
        df.to_csv(f"data{count}.csv")

or you can devide the 20k files into 20 files of 1000 id each then call 1000 id into the list article_ids ? — Arvind, Feb 06 '22 at 07:11
@Arvind Unsure what you mean. Your `df.to_csv("data.csv")` is a string therefore if you run it as it is currently you will only have one file that keeps getting rewritten over with the last data retrieved. — Andrew Ryan, Feb 06 '22 at 07:19

score 0 · Answer 2 · answered Feb 06 '22 at 07:09

You can use grequests. You can try setting stream=False in grequests.get, or call explicitly response.close() after reading response.content. It's discussed in detail here

Additionally, you can also test requests-futures. Grequests is faster but brings monkey patching and additional problems with dependencies. requests-futures is several times slower than grequests but simply wrapped requests into ThreadPoolExecutor can be as fast as grequests, but without external dependencies. Reference here.

Bulk Data from Europe PMC annotation api

2 Answers2