2

I am trying to load some GBs of data stored locally in 6 txt files inside some tables in a dockerized local Dynamodb instance using Python3 and the boto3 library.

The problem is the process speed, the estimated time for loading a single file's data (~10M lines) is of 19 Hours. I used a profiler to find the bottleneck in the code and the majority of the computational time is taken by the boto3 function storing the items in the database.

    def add_batch(self, items, table):
        request = {
            table: []
        }
        if type(items) is not list:
            print(f'\nError while loading item\'s batch, expecting a <list> but got {type(items)}')
            return None
        for item in items:
            request[table].append(
                {
                    'PutRequest': {
                        'Item': item
                    }
                }
            )
        while request:
            response = self._client.batch_write_item(RequestItems=request) # by far the slowest call
            if response['UnprocessedItems']:
                request = response['UnprocessedItems']
                print('unprocessed items: ', request)
            else:
                request = None
        return 0

The batch size is of 25 items, the throughput for the table is 100 (tried a lot of values with little results).

Understandably, I was getting better results when the container was run with the InMemory option set to True. I had to change it because I can't possibly wait hours every time I restart the container waiting for it to load the data. At the moment I start the container with this simple command: docker run -p 8000:8000 amazon/dynamodb-local -jar DynamoDBLocal.jar I tried with some parallelization but the boto3 library doesn't seem to like it since it keeps raising exceptions.

Resources utilization while executing the loading function

tex
  • 45
  • 3
  • 1
    Why are you loading gigs of data into DynamoDB Local? That’s not what it was designed for. – hunterhacker Aug 31 '22 at 23:15
  • @hunterhacker A university project requires us to use Dynamodb and I'd rather not pay for the web service. – tex Sep 01 '22 at 04:49
  • 1
    DynamoDB has a quite generous free tier of 25 reads and 25 writes per second, 25 GB of storage, and more. Just turn off Auto-scaling which is on by default. https://dynobase.dev/dynamodb-free-tier/ I know customers that run their entire business inside the free tier. If you have any other DynamoDB questions, find me on twitter @NoSQLKnowHow. – NoSQLKnowHow Sep 01 '22 at 16:23
  • @NoSQLKnowHow thank you, it is a good idea but since the initial dataset has the equivalent of 100M items, the throttling I'd have to apply while loading it to stay in the free tier would make it not worthwhile – tex Sep 03 '22 at 07:32

1 Answers1

2

DynamoDB.local is not designed at all for performance. It is merely meant to be for offline functional development and testing before deploying to production in the actual DynamoDB service.

NoSQLKnowHow
  • 4,449
  • 23
  • 35