2

I'm currently using Python's Sickle Module to iterate through an OAI repository of ~4 mil records. I've been looking through the Sickle documentation to see if there's an obvious way to separate the records, once returned by sickle.ListRecords, in a way that makes sense to perform a data parallelism task. Put more plainly, this is what I would like to do:

from sickle import Sickle

sickle = Sickle('https://url/to/oai/repository')
recs = sickle.ListRecords(metadataPrefix='oai_dc')

'''separate recs into 1/12th sections for 16 core machine named rec_1 ... rec_12'''

'''core i runs the following process:'''
abstracts = []
for record in rec_i:
    abstracts.append(record['abstract'])

I suspect that there is not a way to do this natively in Sickle itself, but if I can separate the return of sickle.ListRecords into different sections, that would be very helpful. If not, if anyone could recommend an analogous approach in python that would allow for parallelism, I'd be very appreciative.

mgrogger
  • 194
  • 1
  • 9

0 Answers0