1

I have a couple thousands csv files (about 2GB in total). Loading them into memory via

imp.append(pd.read_csv('.\\folder\\'+iterator,delimiter=',', header=None, engine='c',dtype=np.float32))

takes about 60 seconds. In order to speed up the import process, I tried to pickle the data into a single file on the hard drive via

pickle.dump(imp, file, 2)

However, contrary to my expectations, unpickling the data to import it again

imp=pickle.load( file )

takes at least 60 seconds, or even more, as well.

Is this expected behavior? I thought after saving a binary dump of data onto the hard drive we should be able to retrieve it in a fraction of the time that a csv interpreter may need to load it initially. Perhaps pickling is not the correct procedure for this? What should I do to binary dump, and load pre-processed data back into memory more efficiently?

EDIT:

Per suggestion by AMC in the comments, I saved data as:

imp=[]
for iterator in files:
    imp.append(pd.read_csv('.\\folder\\'+iterator,delimiter=',', header=None
               , engine='c',dtype=np.float32).values.tolist())
imp=np.asarray(imp, dtype=np.float32)
np.savez(".\\file.dump.npz", imp)

and load it as

imp=(np.load(".\\file.dump.npz"))['arr_0']

which turns out to be very fast.

print(imp.shape)

(24347, 1140, 10)

print(imp.dtype)

float32

Kagaratsch
  • 963
  • 8
  • 18
  • I think you are both overestimating the amount of work parsing a CSV file takes, and underestimating the amount of work parsing a pickle file takes. The disk read is the time-consuming operation here. – chepner Feb 03 '20 at 22:08
  • @chepner as a comparison, in *Mathematica* loading the CSVs initially takes much longer, but after creating a binary dump of the data in *Mathematica*s `.mx` format, loading it back into memory takes only about 10 seconds. I was hoping Python would have similar capabilities... – Kagaratsch Feb 03 '20 at 22:12
  • Did you name a variable _import_ ? Can you provide some more context for this? What kind of data is it? The _dtype=np.float32_ makes me think that a Pandas DataFrame may not be the appropriate data structure. – AMC Feb 03 '20 at 22:14
  • @AMC the data contains about 12kb worth of floating point numbers per CSV. Any recommendation on how to set up a quicker loading routine for this would be welcome. – Kagaratsch Feb 03 '20 at 22:17
  • @Kagaratsch _the data contains about 12kb worth of floating point numbers per CSV._ Where does it come from, what is it being used for? Are these arrays or something? – AMC Feb 03 '20 at 22:20
  • @AMC data comes from experiment. Each file contains 1140 rows and 10 columns of floating point numbers. The floats contain only about 5-6 relevant digits each, as far as their use is concerned. – Kagaratsch Feb 03 '20 at 22:23
  • @Kagaratsch How is it being used? I'm trying to determine if it would be better to use NumPy arrays. – AMC Feb 03 '20 at 22:25
  • @AMC Oh, yes, in fact I convert them to NumPy arrays after importing, and eventually the numbers are fed to tensorflow as features to train a NN. – Kagaratsch Feb 03 '20 at 22:26
  • wait, is `imp` a list? – juanpa.arrivillaga Feb 03 '20 at 22:26
  • @juanpa.arrivillaga yes, `imp=[]` and then `for iterator in files:` as usual. – Kagaratsch Feb 03 '20 at 22:27
  • 1
    I believe @juanpa.arrivillaga is getting to what my next question was going to be, which is: Does it not make sense to use a single multidimensional array? – AMC Feb 03 '20 at 22:27
  • @AMC that sounds interesting! How would that look like in practice? – Kagaratsch Feb 03 '20 at 22:28
  • 3
    If you are sticking to using multiple distinct arrays, then switch to [`numpy.savez()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.savez.html) and [`numpy.load()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html#numpy.load) for serialization. – AMC Feb 03 '20 at 22:29
  • @Kagaratsch That depends on what you mean by _look like in practice_ :p – AMC Feb 03 '20 at 22:29
  • 1
    @Kagaratsch something like `big_array = np.concatenate([df.values for df in imp])`, then just use `np.save('myarray.npz', big_array)` should be about as fast as possible – juanpa.arrivillaga Feb 03 '20 at 22:30
  • @juanpa.arrivillaga and AMC thanks for the suggestions! I'll give it a try. – Kagaratsch Feb 03 '20 at 22:31
  • @juanpa.arrivillaga is right, assuming that the data first appears in DataFrames, of course. – AMC Feb 03 '20 at 22:31
  • 3
    You could probably ditch `pandas` altogether, I don't know why people seem to use it as a csv-reader. – juanpa.arrivillaga Feb 03 '20 at 22:32
  • @Kagaratsch _How would that look like in practice?_ I'll try to explain the general process: If you have, say 150 arrays of shape (1140, 10), you would combine them into a single array of shape (150, 1140, 10). It's exactly the same sort of process/transformation as grouping a bunch of 1-dimensional arrays to form the rows of a 2-dimensional array, which is a bit easier to visualize. – AMC Feb 03 '20 at 22:35
  • @AMC I believe `imp` ends up being of that type precisely. – Kagaratsch Feb 03 '20 at 22:36
  • @juanpa.arrivillaga _You could probably ditch pandas altogether, I don't know why people seem to use it as a csv-reader._ Or as a NumPy substitute, for that matter. – AMC Feb 03 '20 at 22:37
  • @Kagaratsch _I believe imp ends up being of that type precisely._ When? – AMC Feb 03 '20 at 22:37
  • @AMC after all the `append` operations are done. Since each append writes a single 1140 x 10 dimensional entry, after n such appends we have an n x 1140 x 10 dimensional list, no? – Kagaratsch Feb 03 '20 at 22:38
  • 1
    @Kagaratsch no, `list` objects *don't have dimensions*. They do not have `dtypes`. `list` objects are heterogenous sequences of python objects. multidimensional arrays are something else altogether. – juanpa.arrivillaga Feb 03 '20 at 22:40
  • @Kagaratsch Where does the data come from in the first place? Is it created by your program? – AMC Feb 03 '20 at 22:40
  • @juanpa.arrivillaga oh, I see. AMC, Data is measured by a physical apparatus and stored in csv format. – Kagaratsch Feb 03 '20 at 22:43
  • 1
    @Kagaratsch In that case, you might be able to use [`numpy.loadtxt()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) to read each CSV into an array, and then concatenate them. – AMC Feb 03 '20 at 22:49
  • @AMC thanks for the suggestion, I'll give that a try, too! – Kagaratsch Feb 03 '20 at 22:50
  • @AMC wow, after converting `imp` to a n x 1140 x 10 `numpy.asarray` and using `numpy.savez` as you suggested, the data loads in just `1.5` seconds via `imp=(np.load(mypath))['arr_0']`! If you post this as an answer I'll upvote and accept for sure! – Kagaratsch Feb 03 '20 at 23:17
  • @Kagaratsch Be careful, `numpy.savez()` is for saving multiple arrays into a single file, you need `numpy.save()`. _numpy.asarray_ That’s a function. I think you meant ndarray, the type? What does the code you used look like? – AMC Feb 03 '20 at 23:35
  • @AMC see edit of the question. – Kagaratsch Feb 03 '20 at 23:47
  • What do the contents of `imp` look like? – AMC Feb 03 '20 at 23:48
  • @AMC `imp.shape` is `(24347, 1140, 10)`. – Kagaratsch Feb 03 '20 at 23:52
  • What’s the `dtype`? Can you share the first few elements? – AMC Feb 03 '20 at 23:53
  • @AMC `imp.dtype` is `float32`. – Kagaratsch Feb 04 '20 at 00:00
  • 1
    @Kagaratsch Great, sounds like it's working! You should try reading the CSV files with numpy, though, it will be much simpler. Don't forget to use `numpy.save()`, not `numpy.savez()`, too! – AMC Feb 04 '20 at 00:12

0 Answers0