0

i have a pmc.txt file which contains atleast 20k pmc ids, and the api will only take i think 1000 request each time. i have written the code for one id, but i'm not able to do for the whole file, below is my main code. Please help.

if __name__ == '__main__':
URL = 'https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds'


article_ids = ['PMC:4771370']

for article_id in article_ids:
  params = {
    'articleIds': article_id,
    'section': 'Abstract',
    'provider': 'Europe PMC',
    'format': 'JSON'
  }
json_data = requests.get(URL, params=params).content
r = json.loads(json_data)
df = json_to_dataframe(r)
print(df)
df.to_csv("data.csv")
Arvind
  • 67
  • 8

2 Answers2

0

you can read in the data from the file like so:

with open('pmc.txt', 'r') as file:
    article_ids = [item.replace('\n', '') for item in file]

which you can do instead of article_ids = ['PMC:4771370']

Though you are going to have to save your files with a different name (you will have 20,000 files then or instead you have to append your json data to the dataframe before you make it a csv)

Therefore this would be something that you would do to separate the ids into chunks and use them in a single articleid parameter

if __name__ == '__main__':
    URL = 'https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds'

    with open('corpus_processing_input.txt', 'r') as file:
        article_ids = [item.replace('\n', '') for item in file]
    
    # API only allows 1-8 ids sent at a time 
    chunks = [article_ids[x:x+8] for x in range(0, len(article_ids), 8)]

    for count, article_ids in enumerate(chunks):
        params = {
            'articleIds': ','.join(article_ids),
            'section': 'Abstract',
            'provider': 'Europe PMC',
            'format': 'JSON'
        }
        json_data = requests.get(URL, params=params).content
        r = json.loads(json_data)
        df = json_to_dataframe(r)
        print(df)
        df.to_csv(f"data{count}.csv")
        
Andrew Ryan
  • 1,489
  • 3
  • 15
  • 21
  • or you can devide the 20k files into 20 files of 1000 id each then call 1000 id into the list article_ids ? – Arvind Feb 06 '22 at 07:11
  • @Arvind Unsure what you mean. Your `df.to_csv("data.csv")` is a string therefore if you run it as it is currently you will only have one file that keeps getting rewritten over with the last data retrieved. – Andrew Ryan Feb 06 '22 at 07:19
0

You can use grequests. You can try setting stream=False in grequests.get, or call explicitly response.close() after reading response.content. It's discussed in detail here

Additionally, you can also test requests-futures. Grequests is faster but brings monkey patching and additional problems with dependencies. requests-futures is several times slower than grequests but simply wrapped requests into ThreadPoolExecutor can be as fast as grequests, but without external dependencies. Reference here.

Ragesh Hajela
  • 376
  • 1
  • 6