Problem:
I have list of 2M+ users data in my datastore project. I would like to send a weekly newsletter to all users. The mailing API accepts max 50 email address per API call.
Previous Solution:
Used app-engine backend and a simple datastore query to process all the records at one go. But what happens is, sometimes I get memory overflow critical error log and the process starts all over again. Because of this some users, get the same email more than once. So I moved to dataflow.
Current Solution:
I use the FlatMap function to send each email id to a function and then send email individually to each user.
def process_datastore(project, pipeline_options):
p = beam.Pipeline(options=pipeline_options)
query = make_query()
entities = (p | 'read from datastore' >> ReadFromDatastore(project, query))
entities | beam.FlatMap(lambda entity: sendMail([entity.properties.get('emailID', "")]))
return p.run()
With cloud dataflow, I have ensured that each user gets a mail only once and also nobody is missed out. There are no memory errors.
But this current process takes 7 hours to finish running. I have tried to replace FlatMap with ParDo, with the assumption that ParDo will parallelize the process. But even that takes same time.
Question:
How to bunch the email ids in group of 50, so that the mail API call is effectively used?
How to parallelize the process such that the time taken is less than an hour?