7

I'm loading a CSV file (if you want the specific file, it's the training csv from http://www.kaggle.com/c/loan-default-prediction). Loading the csv in numpy takes dramatically more time than in pandas.

timeit("genfromtxt('train_v2.csv', delimiter=',')", "from numpy import genfromtxt",  number=1)
102.46608114242554

timeit("pandas.io.parsers.read_csv('train_v2.csv')", "import pandas",  number=1)
13.833590984344482

I'll also mention that the numpy memory usage fluctuates much more wildly, goes higher, and has significantly higher memory usage once loaded. (2.49 GB for numpy vs ~600MB for pandas) All datatypes in pandas are 8 bytes, so differing dtypes is not the difference. I got nowhere near maxing out my memory usage, so the time difference can not be ascribed to paging.

Any reason for this difference? Is genfromtxt just way less efficient? (And leaks a bunch of memory?)

EDIT:

numpy version 1.8.0

pandas version 0.13.0-111-ge29c8e8

Kurt Spindler
  • 1,311
  • 1
  • 14
  • 22
  • 6
    Basically, yes. `genfromtxt` is just way less efficient. It's not that it leaks memory, just that it essentially reads everything in in python lists and then converts to a numpy array. `pandas.read_csv` is just that much more efficient. Not to plug my own answer, but see here: http://stackoverflow.com/questions/8956832/python-out-of-memory-on-large-csv-file-numpy/8964779#8964779 for a comparison of the various numpy text loading approaches. (That answer deliberately leaves `pandas.read_csv` out, but it's similar in performance to the last example.) – Joe Kington Jan 31 '14 at 18:08
  • 2
    If that's really the case, maybe I'll see what I can do about submitting a patch to numpy. As it stands, loading the DataFrame followed by `df.as_matrix()` was ~15s total, compared to 102 for genfromtxt – Kurt Spindler Jan 31 '14 at 18:10
  • And thank you for the pointer to your other question, that is informative. – Kurt Spindler Jan 31 '14 at 18:13
  • 5
    Here's the original article on ``read_csv`` from Wes who wrote it: http://wesmckinney.com/blog/?p=543 – Jeff Jan 31 '14 at 18:24
  • 1
    @JoeKington Perhaps you can post you comments as an answer... – Saullo G. P. Castro Feb 01 '14 at 16:35
  • Here is the updated link of the post by Wes as suggested by @Jeff : https://wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/ – Ken T Jun 17 '19 at 12:57

1 Answers1

0

'genfromtxt' from the Numpy module run two main loops. First one convert all the lines in a file to string and then other loop convert each string to their data type. But you get more flexibility in 'genfromtxt' than other command like loadtxt and read_csv.