I am trying to get the number of unique values in a specific column in a csv (~10 GB) and am looking for the fastest way to do that. I expected command line tools like csvstat
to run faster than pandas, but:
def get_uniq_col_count(colname):
df = pd.read_csv('faults_all_main_dp_1_joined__9-4-15.csv', engine='c', usecols=[colname], nrows = 400000)
df.columns = ['col']
return len(set(df.col.values)), list(set(df.col.values))
t1 = datetime.datetime.now()
count, uniq = get_uniq_col_count('model')
print(datetime.datetime.now() - t1)
# 0:00:04.386585
vs.
$ time csvcut -c model faults_all_main_dp_1_joined__9-4-15.csv | head -n 400000 | csvstat --unique
3
real 1m3.343s
user 1m3.212s
sys 0m0.237s
(I am doing the header, because I let csvstat
run on the whole dataset, went out for lunch, came back, and itβs still running. It took pandas
50 sec to finish.)
I wonder if I am doing something wrong, and in general, if there is a way to speed up the process. (There are about 5 million rows to read through for each column.)