I have a couple thousands csv
files (about 2GB in total). Loading them into memory via
imp.append(pd.read_csv('.\\folder\\'+iterator,delimiter=',', header=None, engine='c',dtype=np.float32))
takes about 60
seconds. In order to speed up the import process, I tried to pickle
the data into a single file
on the hard drive via
pickle.dump(imp, file, 2)
However, contrary to my expectations, unpickling the data to import it again
imp=pickle.load( file )
takes at least 60
seconds, or even more, as well.
Is this expected behavior? I thought after saving a binary dump of data onto the hard drive we should be able to retrieve it in a fraction of the time that a csv
interpreter may need to load it initially. Perhaps pickling is not the correct procedure for this? What should I do to binary dump, and load pre-processed data back into memory more efficiently?
EDIT:
Per suggestion by AMC in the comments, I saved data as:
imp=[]
for iterator in files:
imp.append(pd.read_csv('.\\folder\\'+iterator,delimiter=',', header=None
, engine='c',dtype=np.float32).values.tolist())
imp=np.asarray(imp, dtype=np.float32)
np.savez(".\\file.dump.npz", imp)
and load it as
imp=(np.load(".\\file.dump.npz"))['arr_0']
which turns out to be very fast.
print(imp.shape)
(24347, 1140, 10)
print(imp.dtype)
float32