0

I have a huge csv file (like 8GB or much more, with millions of lines), the first field is a text field (no quoting), the second one is a date in the format mm/dd/yyyy. The other fields may vary. No header, utf8 encoding. See an example here:

Lorem ipsum dolor sit amet,10/30/2020,2340.234450,pet,999
consectetur adipiscing elit,10/30/2020,54.2,home,577

I need to efficiently (as quick as possible) sort the file by dates, using Python, without loading all the file together into memory. The problem is I have little memory (4GB RAM or so). Older dates should go first.

I found some solutions (e.g. this and this) directly using the OS commands, but none specific to Python and date fields. Also, I cannot use databases. Could you help me?

Forinstance
  • 413
  • 4
  • 17
  • 2
    You can't sort something without loading it into memory. The pages you refer to, still are loading it into memory, but into *someone else's* memory. Eg. DOS/Unix SORT command's memory If you want to sort it, the entire contents have to be available in memory, otherwise the sort maybe inaccurate. – Benjamin Schollnick Oct 30 '20 at 15:36
  • I though I could, dividing it in chunks, or similar. Similar to this question: https://stackoverflow.com/questions/7361074/how-can-i-sort-large-csv-file-without-loading-to-memory. Now I edited the question, to be more specific. Could it be that I recursively load part of the file into memory and not all of it, if this is too much? – Forinstance Oct 30 '20 at 15:39
  • I would probably be better to use something like pandas to read the file, sort it by date and then save it again so it is at least in order. You are asking for a lot and expecting to do little. If memory is still an issue you could split the data up but I don't think you are going to get an answer you are looking for. – sntrenter Oct 30 '20 at 15:42

1 Answers1

0

You can try some "hack" solutions like uploading your file to a machine with more RAM (colab for example, though the upload will take some time).

You could also increase your SWAP memory to be able to load artificially in the RAM.

Or you could clean it a bit by extracting only the date column and an ID for each row (open it in chunks) then create a new dataframe with only ID and date, sort it normally and you will have the indexes in order.

However, you can't order your dataframe in chunks because depending on the sorting algorithm, it may be inaccurate. If you have this: 3 9 1 2 5 6 4 0 8 7 and order it in two chunks, you will have these: 1 2 3 5 9 || 0 4 6 8 7 how do you combine them without having to reorder everything?