1

I am trying to get the number of unique values in a specific column in a csv (~10 GB) and am looking for the fastest way to do that. I expected command line tools like csvstat to run faster than pandas, but:

def get_uniq_col_count(colname):
    df = pd.read_csv('faults_all_main_dp_1_joined__9-4-15.csv', engine='c', usecols=[colname], nrows = 400000)
    df.columns = ['col']
    return len(set(df.col.values)), list(set(df.col.values))

t1 = datetime.datetime.now()
count, uniq = get_uniq_col_count('model')
print(datetime.datetime.now() - t1)

# 0:00:04.386585

vs.

$ time csvcut -c model faults_all_main_dp_1_joined__9-4-15.csv | head -n 400000 | csvstat --unique
3
real    1m3.343s
user    1m3.212s
sys     0m0.237s

(I am doing the header, because I let csvstat run on the whole dataset, went out for lunch, came back, and it’s still running. It took pandas 50 sec to finish.)

I wonder if I am doing something wrong, and in general, if there is a way to speed up the process. (There are about 5 million rows to read through for each column.)

Anarcho-Chossid
  • 2,210
  • 4
  • 27
  • 44

1 Answers1

0

This is not surprising. 'csvtools' are written entirely in python and do not use any optimization tricks. In particular, 'csvstat' will read entire table into memory and store multiple copies of it as regular python list. This has huge overhead -- both in memory and in the garbage collector time.

Oh the other hand, pandas uses numpy, which means that the whole column uses only a few python objects, and almost no memory overhead. You may be able to make your program slightly faster by using pandas-specific unique method (Find unique values in a Pandas dataframe, irrespective of row or column location)

If you are going to be doing this a lot, convert the data to a more efficient format. This page: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization/ shows that if once you convert your data to HDF5, it will be much faster to load.

Community
  • 1
  • 1
theamk
  • 1,420
  • 7
  • 14