0

I have a lot of csv file made by 3 columns like this:

fac simile of files: file_1, file_4, file_5, file_7, etc 
(All the same file name, != only the final numbers at the end. Them are not consecutive tho as in the 
example)


the inside

['357', '29384', '0.0031545741324921135']
['357', '29389', '0.0031545741324921135']
['357', '29526', '0.0368574903844921735']
['357', '35516', '0.0036775741324564665']
['357', '35551', '0.0023554341325646453']
['357', '35639', '0.0064467781324766535']
['357', '36238', '0.0067543874132467543']
['357', '37162', '0.0031545746577921135']

Let's name the 3 columns [a,b,c]. I'd like to sort them by c, so the last column. I have to read all the files and sort all the content ina huge one. I can use a pickle for example.

My first idea was:

import csv
from operator import itemgetter
fn = 1
# N as the max number in the really last file
while fn < N:
   newfile = open("file_{fn}.csv","r")
   reader = csv.reader(newfile)

   file = open("BigSortedFile.csv","w")

   for line in sorted(reader, key=itemgetter(2)):
   file.write(line)

   fn = fn +1
file.close()

#after the loop I think I have to sort again the BigSortedFile.

But it's not working because I need a string, not a line. How can I do the whole process?

HugoB
  • 121
  • 7

1 Answers1

1

To sort all lines you need to read them all into one datastructure, then write them again.

The csv module needs you to open files with newline="" to work properly. When you use a csv.reader to read, you can also use a csv.writer to write your data:

import csv
from operator import itemgetter

fn = 1  # first file has number 1 in filename
N = 42  # last numer in file-names is 42

data = []
while fn < N:
   with open("file_{fn}.csv", "r", newline="") as newfile:
       reader = csv.reader(newfile)
       data.extend(list(reader))

data.sort(key=itemgetter(2))

with open("BigSortedFile.csv", "w", newline="") as bf:
    writer = csv.writer(bf)
    writer.writerows(data)
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • Ok thanks. Now I'm trying to see if this works even if it's taking time. I have also some GB of data, I really dunno if this will work for so much stuff – HugoB Dec 05 '20 at 21:37
  • 1
    @Hugo you should have mentioned that - I highly doubt it will work - GBsounds as if it won't fit into memory. You would need to maybe parially sort stuff and you _definitly_ should look into pandas or something alike to wrangle that much of data. – Patrick Artner Dec 05 '20 at 21:42
  • 1
    @HugoB: [how-do-i-read-a-large-csv-file-with-pandas](https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas) and [python-pandas-merge-multiple-csv-files](https://stackoverflow.com/questions/48051100/python-pandas-merge-multiple-csv-files) – Patrick Artner Dec 05 '20 at 21:51