1

I have large file around 6GB csv, containing 37000000 lines. I need to upload all these lines using below sample request

curl --location --request POST 'http://localhost:7234/feedback/ingest/csv' \
--header 'charset: UTF-8' \
--form 'file=@"/home/new_file_1.csv"'

Destination api has constraint of 2MB(~ 12000 lines). I also have disk constraint(max 400MB more), so can't split into multiple small files. Only way I find in python is to iterate over rows and at each chunk(~10000Lines) create and dump in a temp file and fire post request. Is there any other better way?

Dev Dev
  • 55
  • 1
  • 7
  • Do you have the ability/access to change the API? If not then you won't be able to upload such a large file without the API provider removing the limit or providing a separate mechanism for uploading files – Iain Shelvington Mar 05 '22 at 02:04
  • No cant change the api, though I can hit parallel request. Is there any way to avoid creating new file and directly use in memory read line chunks? – Dev Dev Mar 05 '22 at 02:07
  • 1
    Does this answer your question? [Python: HTTP Post a large file with streaming](https://stackoverflow.com/questions/2502596/python-http-post-a-large-file-with-streaming) – Gino Mempin Mar 05 '22 at 02:18
  • Can you clarify what you want to do and what the constraints are? Are you trying to send a single request like you show, with many smaller parts of a single multipart submnission, or are you looking to send multiple requests? To what exactly does the 2M constraint apply? – CryptoFool Mar 05 '22 at 02:26
  • Single request of 2M is constraint, parallel request of similar size is fine. Also disk space is constraint so cant create lot of small files – Dev Dev Mar 05 '22 at 02:33
  • 1
    Why not just open and read the file in your Python code, buffering lines until you've got the number/size you want, sending a Requests request with that buffer, and then clearing the buffer and accumulating again? A 2MB chunk in memory is nothing. No reason to write it to disk. Requests will happily take an in-memory payload. Even if it wouldn't, you could wrap a file-like object around your buffer and treat it like a file stream. – CryptoFool Mar 05 '22 at 02:41

1 Answers1

1

You can read that large file using stream. Read the file line by line until you got to a point that you accumulated 12000 lines and send it via http request.

import requests

url = 'http://localhost:7234/feedback/ingest/csv'

def read_file(file_name, num_lines):
    with open(file_name) as file:
        count = 0
        lines = ''
        for line in file:
            count+=1
            lines += line

            if count >= num_lines:
                yield lines
                lines = ''
                count = 0
        yield lines

for chunk in read_file("/home/new_file_1.csv", 12000):

    # prepare payload 
    files = {'file': ('chunk.csv', chunk, 'text/csv', {'Expires': '0'})}

    # send it to the api
    r = requests.post(url, files=files)

I hope this helps

  • Ohh i misunderstood the question, the API only accepts csv. Maybe you can create a temporary file after accumulating 12000 lines, send it and immediately delete it after successful send to the API – Christopher Enriquez Mar 05 '22 at 02:23
  • Don't you need to reset `count` and `lines` after you make each request? As it is now, It seems that you'll keep adding one more line to the payload and will then send the whole mess again. – CryptoFool Mar 05 '22 at 02:39
  • @CryptoFool thats correct, i already updated the answer to resolve that issue – Christopher Enriquez Mar 05 '22 at 02:41
  • Isn't this going to break the file at arbitrary places, not on line boundaries? That's not going to do something good. Also, you are not decoding the CSV (and you don't want to), so why are you writing it back out using a CSV writer? It's already CSV. No need to write it out to file either. Requests will take an in-memory buffer as a payload or part same as a file. Your first version was better than this one. Simpler and better. – CryptoFool Mar 05 '22 at 02:45
  • @CryptoFool The intention is to create a 2MB worth of CSV file because the API have that limitation. – Christopher Enriquez Mar 05 '22 at 02:49
  • Yeah, but I expect that whatever is processing the data on the other end will assume that it is getting complete lines in each request. I can't imagine why you would think you could assume that the processor is going to realize that the last part of the request is an incomplete line, hold onto it until the next request comes in, read the partial line from the second request, splice the parts together, process that line, and then move on. No reason you can't and shouldn't break at line boundaries. Your first version was reading lines. That's the better idea. – CryptoFool Mar 05 '22 at 02:51
  • @CryptoFool ooh yeah, i get what you mean. Sorry for the half bake answer – Christopher Enriquez Mar 05 '22 at 02:52
  • @CryptoFool how about the updated one? Will that suffice? – Christopher Enriquez Mar 05 '22 at 03:09