Why is unpickling data so slow? Any way to speed up?

Question

I have a couple thousands csv files (about 2GB in total). Loading them into memory via

imp.append(pd.read_csv('.\\folder\\'+iterator,delimiter=',', header=None, engine='c',dtype=np.float32))

takes about 60 seconds. In order to speed up the import process, I tried to pickle the data into a single file on the hard drive via

pickle.dump(imp, file, 2)

However, contrary to my expectations, unpickling the data to import it again

imp=pickle.load( file )

takes at least 60 seconds, or even more, as well.

Is this expected behavior? I thought after saving a binary dump of data onto the hard drive we should be able to retrieve it in a fraction of the time that a csv interpreter may need to load it initially. Perhaps pickling is not the correct procedure for this? What should I do to binary dump, and load pre-processed data back into memory more efficiently?

EDIT:

Per suggestion by AMC in the comments, I saved data as:

imp=[]
for iterator in files:
    imp.append(pd.read_csv('.\\folder\\'+iterator,delimiter=',', header=None
               , engine='c',dtype=np.float32).values.tolist())
imp=np.asarray(imp, dtype=np.float32)
np.savez(".\\file.dump.npz", imp)

and load it as

imp=(np.load(".\\file.dump.npz"))['arr_0']

which turns out to be very fast.

print(imp.shape)

(24347, 1140, 10)

print(imp.dtype)

float32

I think you are both overestimating the amount of work parsing a CSV file takes, and underestimating the amount of work parsing a pickle file takes. The disk read is the time-consuming operation here. — chepner, Feb 03 '20 at 22:08
@chepner as a comparison, in *Mathematica* loading the CSVs initially takes much longer, but after creating a binary dump of the data in *Mathematica*s `.mx` format, loading it back into memory takes only about 10 seconds. I was hoping Python would have similar capabilities... — Kagaratsch, Feb 03 '20 at 22:12
Did you name a variable _import_ ? Can you provide some more context for this? What kind of data is it? The _dtype=np.float32_ makes me think that a Pandas DataFrame may not be the appropriate data structure. — AMC, Feb 03 '20 at 22:14
@AMC the data contains about 12kb worth of floating point numbers per CSV. Any recommendation on how to set up a quicker loading routine for this would be welcome. — Kagaratsch, Feb 03 '20 at 22:17
@Kagaratsch _the data contains about 12kb worth of floating point numbers per CSV._ Where does it come from, what is it being used for? Are these arrays or something? — AMC, Feb 03 '20 at 22:20
@AMC data comes from experiment. Each file contains 1140 rows and 10 columns of floating point numbers. The floats contain only about 5-6 relevant digits each, as far as their use is concerned. — Kagaratsch, Feb 03 '20 at 22:23
@Kagaratsch How is it being used? I'm trying to determine if it would be better to use NumPy arrays. — AMC, Feb 03 '20 at 22:25
@AMC Oh, yes, in fact I convert them to NumPy arrays after importing, and eventually the numbers are fed to tensorflow as features to train a NN. — Kagaratsch, Feb 03 '20 at 22:26
@juanpa.arrivillaga yes, `imp=[]` and then `for iterator in files:` as usual. — Kagaratsch, Feb 03 '20 at 22:27
I believe @juanpa.arrivillaga is getting to what my next question was going to be, which is: Does it not make sense to use a single multidimensional array? — AMC, Feb 03 '20 at 22:27
@AMC that sounds interesting! How would that look like in practice? — Kagaratsch, Feb 03 '20 at 22:28
If you are sticking to using multiple distinct arrays, then switch to [`numpy.savez()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.savez.html) and [`numpy.load()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html#numpy.load) for serialization. — AMC, Feb 03 '20 at 22:29
@Kagaratsch That depends on what you mean by _look like in practice_ :p — AMC, Feb 03 '20 at 22:29
@Kagaratsch something like `big_array = np.concatenate([df.values for df in imp])`, then just use `np.save('myarray.npz', big_array)` should be about as fast as possible — juanpa.arrivillaga, Feb 03 '20 at 22:30
@juanpa.arrivillaga and AMC thanks for the suggestions! I'll give it a try. — Kagaratsch, Feb 03 '20 at 22:31
@juanpa.arrivillaga is right, assuming that the data first appears in DataFrames, of course. — AMC, Feb 03 '20 at 22:31
You could probably ditch `pandas` altogether, I don't know why people seem to use it as a csv-reader. — juanpa.arrivillaga, Feb 03 '20 at 22:32
@Kagaratsch _How would that look like in practice?_ I'll try to explain the general process: If you have, say 150 arrays of shape (1140, 10), you would combine them into a single array of shape (150, 1140, 10). It's exactly the same sort of process/transformation as grouping a bunch of 1-dimensional arrays to form the rows of a 2-dimensional array, which is a bit easier to visualize. — AMC, Feb 03 '20 at 22:35
@juanpa.arrivillaga _You could probably ditch pandas altogether, I don't know why people seem to use it as a csv-reader._ Or as a NumPy substitute, for that matter. — AMC, Feb 03 '20 at 22:37
@Kagaratsch _I believe imp ends up being of that type precisely._ When? — AMC, Feb 03 '20 at 22:37
@AMC after all the `append` operations are done. Since each append writes a single 1140 x 10 dimensional entry, after n such appends we have an n x 1140 x 10 dimensional list, no? — Kagaratsch, Feb 03 '20 at 22:38
@Kagaratsch no, `list` objects *don't have dimensions*. They do not have `dtypes`. `list` objects are heterogenous sequences of python objects. multidimensional arrays are something else altogether. — juanpa.arrivillaga, Feb 03 '20 at 22:40
@Kagaratsch Where does the data come from in the first place? Is it created by your program? — AMC, Feb 03 '20 at 22:40
@juanpa.arrivillaga oh, I see. AMC, Data is measured by a physical apparatus and stored in csv format. — Kagaratsch, Feb 03 '20 at 22:43
@Kagaratsch In that case, you might be able to use [`numpy.loadtxt()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) to read each CSV into an array, and then concatenate them. — AMC, Feb 03 '20 at 22:49
@AMC wow, after converting `imp` to a n x 1140 x 10 `numpy.asarray` and using `numpy.savez` as you suggested, the data loads in just `1.5` seconds via `imp=(np.load(mypath))['arr_0']`! If you post this as an answer I'll upvote and accept for sure! — Kagaratsch, Feb 03 '20 at 23:17
@Kagaratsch Be careful, `numpy.savez()` is for saving multiple arrays into a single file, you need `numpy.save()`. _numpy.asarray_ That’s a function. I think you meant ndarray, the type? What does the code you used look like? — AMC, Feb 03 '20 at 23:35
@Kagaratsch Great, sounds like it's working! You should try reading the CSV files with numpy, though, it will be much simpler. Don't forget to use `numpy.save()`, not `numpy.savez()`, too! — AMC, Feb 04 '20 at 00:12

Why is unpickling data so slow? Any way to speed up?

0 Answers0