2

I am trying to read a csv file I created before in python using

with open(csvname, 'w') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',')
    csvwriter.writerows(data)

Data ist a random matrix containing about 30k * 30k entries, np.float32 format. About 10 GB file size in total.

Once I read in the file using this function (since I know the size of my matrix already and np.genfromtxt is increadibly slow and would need about 100 GB RAM at this point)

def read_large_txt(path, delimiter=',', dtype=np.float32, nrows = 0):
    t1 = time.time()
    with open(path, 'r') as f:
        out = np.empty((nrows, nrows), dtype=dtype)
        for (ii, line) in enumerate(f):
            if ii%2 == 0:
                out[int(ii/2)] = line.split(delimiter)
    print('Reading %s took %.3f s' %(path, time.time() - t1))
return out

it takes me about 10 minutes to read that file. The hard drive I am using should be able to read about 100 MB/s which would decrease the reading time to about 1-2 minutes.

Any ideas what I may be doing wrong?

Related: why numpy narray read from file consumes so much memory? That's where the function read_large_txt is from.

Forrest Thumb
  • 401
  • 5
  • 16
  • Maybe I should add that I am using if ii%2 == 0: because otherwise I'd try to pass empty lines to the output matrix – Forrest Thumb Apr 17 '18 at 09:31
  • Do you have enough ram? – Seer.The Apr 17 '18 at 09:34
  • Extract out initialization from the reading time to be sure it is related with the file size. If nrows is big it may use swap – Benjamin Apr 17 '18 at 09:34
  • Yep I got 120 GB RAM. It does read the entire file, I was just wondering If there is a way to do that faster. – Forrest Thumb Apr 17 '18 at 09:35
  • Maybe split has a bad implementation like in Java (causing a lot of string allocation), try with csv reader – Benjamin Apr 17 '18 at 09:36
  • @Benjamin apparently it takes about 15 ms to split one line and to write it to the variable 'out'. I dont think I'll get the single line splitting faster, but I will try to use some multithreading to check several lines simultaneously . – Forrest Thumb Apr 17 '18 at 11:17

1 Answers1

1

I found a quite simple solution. Since I am creating the files as well, I don't need to save them as a .csv-file. It is way (!) faster to load them as .npy files:

Loading (incl. splitting each line by ',') a 30k * 30k matrix stored as .csv takes about 10 minutes. Doing the same with a matrix stored as .npy takes about 10 seconds!

That's why I have to change the code I wrote above to:

np.save(npyname, data)

and in the other script to

out = np.load(npyname + '.npy')

Another advantage of this method is: (in my case) the .npy files only have about 40% the size of the .csv files. :)

Forrest Thumb
  • 401
  • 5
  • 16
  • 1
    I'd advise you also look up HDF5 if you have larger data. This has native compression and, on top, is accessible across languages. – jpp Apr 17 '18 at 12:07
  • @jpp Does HDF5 increase the loading speed further? – Forrest Thumb Apr 17 '18 at 12:09
  • For *large* data (say 1GB+), if you choose the correct options, in my experience yes. – jpp Apr 17 '18 at 12:15
  • Might be worth a try then. My files have 4GB+ with npy and 10 GB+ with csv. – Forrest Thumb Apr 17 '18 at 12:18
  • @jpp I checked it out. To be honest, I was not able to increase the loading speed. Probably I am already at the machine limit (4 GB in 4-6 s is already about 1 GB/s). Anyway, using the gzip compression (compression_opts 9) I could decrease the file size a little bit (3.7 GB instead of 4 GB), but my data are quite random, so that might work out better with other data structures. But, using that compression, writing to the hard drive takes much longer. For now I will probably keep using .npy files. – Forrest Thumb Apr 18 '18 at 06:33
  • Gzip is slow, try lzf. But you are right, this may sacrifice write speed for read speed. Usually it's a balance you have to decide on. – jpp Apr 18 '18 at 07:09
  • If speed is an issue I would use BLOSC. eg. https://stackoverflow.com/a/48997927/4045774 BTW: 1GB/s seems a bit slow for a m2-ssd, but way to fast for all other SSDs or especially HDDs. Saving and directly loading is not the same than saving, doing something and reload the data. If the data is already somewhere in the RAM you mess the whole or parts of the reading process. – max9111 Apr 18 '18 at 14:21