0

I would like to access the ScopusSearch API and obtain the EIDs of a list of 1400 article titles that are saved in an excel spreadsheet. I tried to retrieve the EIDs via the following code:

import numpy as np
import pandas as pd
from pybliometrics.scopus import ScopusSearch
nan = pd.read_excel(r'C:\Users\Apples\Desktop\test\titles_nan.xlsx', sheet_name='nan')
error_index = {}

for i in range(0,len(nan)):
   scopus_title = nan.loc[i ,'Title']
   s = ScopusSearch('TITLE("{0}")'.format(scopus_title))
   print('TITLE("{0}")'.format(scopus_title))
   try:
      s = ScopusSearch(scopus_title)
      nan.at[i,'EID'] = s.results[0].eid
      print(str(i) + ' ' + s.results[0].eid)
   except:
      nan.loc[i,'EID'] = np.nan
      error_index[i] = scopus_title
      print(str(i) + 'error' )

However, I was never able to retrieve the EIDs beyond 100 titles (approximately) because certain titles yield far too many searches and that stalls the entire process.

As such, I wanted to skip titles that contain too many searches and move on to the next title, all while keeping a record of the titles that were skipped.

I am just starting out with Python so I am not sure how to go about doing this. I have the following sequence in mind:

• If the title yields 1 search, retrieve the EID and record it under the ‘EID’ column of file ‘nan’.

• If the title yields more than 1 search, record the title in the error index, print ‘Too many searches’ and move on to the next search.

• If the title does not yield any searches, record the title in the error index, print ‘Error’ and move on to the next search.

Attempt 1
for i in range(0,len(nan)):
   scopus_title = nan.at[i ,'Title']
   print('TITLE("{0}")'.format(scopus_title))
s = ScopusSearch('TITLE("{0}")'.format(scopus_title))
print(type(s))

if(s.count()== 1):
    nan.at[i,"EID"] = s.results[0].eid
    print(str(i) + "   " + s.results[0].eid)
elif(s.count()>1):
    continue
    print(str(i) + "  " + "Too many searches")
else:
    error_index[i] = scopus_title
    print(str(i) + "error")

Attempt 2
for i in range(0,len(nan)):
    scopus_title = nan.at[i ,'Title']<br/>
    print('TITLE("{0}")'.format(scopus_title))<br/>
    s = ScopusSearch('TITLE("{0}")'.format(scopus_title))
    if len(s.results)== 1:
        nan.at[i,"EID"] = s.results[0].eid
        print(str(i) + "   " + s.results[0].eid)
    elif len(s.results)>1:  
        continue
        print(str(i) + "  " + "Too many searches")
    else:
        continue
        print(str(i) + "  " + "Error")

I got errors stating that object of type 'ScopusSearch' has no len() /count() or the searches or not a list themselves. I am unable to proceed from here. In addition, I am not sure if this is the right way to go about it – skipping titles based on too many searches. Are there more effective methods (e.g. timeouts – skip the title after a certain amount of time is spent on the search).

Any help on this matter would be very much appreciated. Thank you!

Apples
  • 29
  • 5
  • The library docs mention [`get_results_size()`](https://pybliometrics.readthedocs.io/en/stable/classes/ScopusSearch.html#pybliometrics.scopus.ScopusSearch.get_results_size) , but I assume that will only be available after results are fetched, which defeats your purpose? – shriakhilc Dec 28 '21 at 14:20
  • 1
    Not when you combine it with `download=False`, @shriakhilc ;) – MERose Dec 28 '21 at 14:58

1 Answers1

1

Combine .get_results_size() with download=False:

from pybliometrics.scopus import ScopusSearch

scopus_title = "Editorial"
q = f'TITLE("{scopus_title}")'  # this is f-string notation, btw
s = ScopusSearch(q, download=False)
s.get_results_size()
# 243142

if this number is below a certain threshold, simply do s = ScopusSearch(q) and proceed as in "Attempt 2":

for i, row in nan.iterrows():
    q = f'TITLE("{row['Title']}")'
    print(q)
    s = ScopusSearch(q, download=False)
    n = s.get_results_size()
    if n == 1:
        s = ScopusSearch(q)
        nan.at[i,"EID"] = s.results[0].eid
        print(f"{i} s.results[0].eid")
    elif n > 1:
        print(f"{i} Too many results")
        continue  # must come last
    else:
        print(f"{i} Error")
        continue  # must come last

(I used the .iterrows() here to get rid of the indexation. But the i will be incorrect if the index is not a range sequence - in this case enclose all in enumerate().)

MERose
  • 4,048
  • 7
  • 53
  • 79