2

My use case is to write to DynamoDB from a Spark application. As I have limited write capacity for DynamoDB and do not want to increase it for cost implications, how can I limit the Spark application to write at a regulated speed?

Can this be achieved by reducing the partitions to 1 and then executing foreachPartition()?

I already have auto-scaling enabled but don't want to increase it any further.

Please suggest other ways of handling this.

EDIT: This needs to be achieved when the Spark application is running on a multi-node EMR cluster.

Abhay Dubey
  • 549
  • 2
  • 7
  • 18

1 Answers1

1

Bucket scheduler

The way I would do this is to create a token bucket scheduler in your Spark application. A token bucket pattern is a common to design to ensure an application does not breach API limits. I have used this design successfully in very similar situations. You may find someone has written a library you can use for this purpose.

DynamoDB retry

Another (less attractive), option would be to increase the retry times on your DynamoDB connection. When your write does not succeed due to throughput provision exceeded, you can essentially instruct your DyanmoDB SDK to keep retrying for as long as you like. Details in this answer. This option may appeal if you want a 'quick and dirty' solution.

F_SO_K
  • 13,640
  • 5
  • 54
  • 83
  • That is a good approach but in case of a cluster with multiple nodes, it won't be feasible. Any solutions considering this restriction? – Abhay Dubey May 18 '18 at 12:30
  • Token bucket can be extended to work relatively ok in a distributed system as long as you know how many nodes participate in the system. If you rely on DynamoDB to essentially handle the throttling for you, then that just works out of the box.. Make your client retry "indefinitely" and just let it rip – Mike Dinescu May 19 '18 at 16:13