0

I have a massive dump of data that I need uploaded through an API. The requests need to be done one record at a time because of data validation reasons.

The API can support up to 1000 records in one POST request, but the validation error is too vague to identify which record has an issue.

I'll spare the unnecessary details. My script is essentially doing this:

for row in reader:
    values = {...} # Build dict from row to pass on to request
    upload_record(values)

With hundreds of thousands of records, this is extremely slow. I'm currently sitting around 50k after 15 hours.

How can I speed this up. The order of the data doesn't matter, it just needs to get from the csv to the API table as fast as possible.

23k
  • 1,596
  • 3
  • 23
  • 52
  • If you try to upload an item twice is that a problem? Does doing so generate an error? – Steven Rumbalski Oct 16 '18 at 15:17
  • @StevenRumbalski It won't generate an error but we don't want multiple copies of the same data. – 23k Oct 16 '18 at 15:18
  • If you try to upload 1000 records at once and get an error does the whole upload fail or does just the bad record fail? – Steven Rumbalski Oct 16 '18 at 15:19
  • @StevenRumbalski the whole upload will fail. – 23k Oct 16 '18 at 15:19
  • Why not try to upload 1000? If you get an error try the first 500. If that errors try then first 250. Basically binary search down to the bad records. – Steven Rumbalski Oct 16 '18 at 15:20
  • That doesn't help with speed though. – slider Oct 16 '18 at 15:22
  • @slider: If the errors are sparse it would. The slowest part of a web api is network time. The goal would be make most attempts count as multiple hits. Even if he went with attempting 10 at once and fell back to 1 at a time on any group of 10 that errored out it should speed up. – Steven Rumbalski Oct 16 '18 at 15:24
  • 2
    @23k, does server support parallel POST requests? – Rahul Chawla Oct 16 '18 at 15:24
  • "The requests need to be done one record at a time because of data validation reasons." Can you explain more? Even with data validation and potential of failure, you don't necessarily need to do one record at a time. – Louis Ng Oct 16 '18 at 15:26
  • @RahulChawla It's not my server, so I don't have all the specifics. I would guess based on the number of people/requests they receive the answer is yes. – 23k Oct 16 '18 at 15:26
  • 1
    @StevenRumbalski suggestion is quite useful you can try and upload in chunks, it will be a lot faster than the method you are using. Plus you can also try asynchronous and parallel requests. – Rahul Chawla Oct 16 '18 at 15:28
  • @RahulChawla If you could provide a link to an example of parallel requests that would be helpful, I'm not familiar with their implementation in Python :) – 23k Oct 16 '18 at 15:31
  • 1
    Here you go! https://stackoverflow.com/questions/43448042/parallel-post-requests-using-multiprocessing-and-requests-in-python Do read the question and answer carefully. – Rahul Chawla Oct 16 '18 at 15:34

0 Answers0