1

The aim is to find the total number of rows in a large CSV file. I m using Python Dask to find it for now, but as the file size is around 45G it takes quite some time. Unix cat with wc -l seems to perform better.

So the question is - Are there any tweaks for dask / pandas read_csv to make it find the total numbers of rows faster?

ranger101
  • 1,184
  • 4
  • 12
  • 20
  • Not going to flag as duplicate, but have a look [at this post](https://stackoverflow.com/q/845058/6340496). Might be really helpful, it’s def helped me on the same subject! – S3DEV Sep 07 '20 at 22:16

2 Answers2

2

Dask dataframe will spend 90% of its time parsing your text into various numerical types like int, float, etc.. You don't need any of this, so it's best not to make anything like a dataframe.

You could use dask.bag, which would be faster/simpler

dask.bag.read_text("...").count().compute()

But in truth wc -l is going to be about as fast as anything else. You should be entirely bound by your disk speeds here, and not by compute power. Dask helps you to leverage multiple cores on your CPU, but those aren't the bottleneck in this case, so Dask isn't the right tool, wc is.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
1

You can try subprocess in python code:

fileName = "file.csv"
cmd = 'wc -l {0}'.format(fileName)
output = subprocess.call(cmd)
Soumendra Mishra
  • 3,483
  • 1
  • 12
  • 38
  • Thanks for this. I have been doing this, but my question is more on any tweaks for pandas/dask to find it – ranger101 Sep 06 '20 at 05:30