1

I want to create a subsample of n rows using Dask. I tried 2 approaches:

1.Using frac:

import dask.dataframe as dd    
read_path = ["test_data\\small1.csv", "test_data\\small2.csv", "huge.csv"]
df = dd.read_csv(read_path)
df = df.sample(frac=0.0001)
df = df.compute()

It works fast enough - select 10000 from 100mil dataset for 16 secs. But it can't guarantee exact number of rows - it will be rounded because of using frac.

2.Using for loop:

nrows = 10000
res_df = []
length = csv_loader.get_length()
total_len = sum(length)
start = perf_counter()
inds = random.sample(range(total_len), total_len - nrows - len(length))
min_bound = 0
relative_inds = []
for leng in length:
    relative_inds.append(
        sorted([i - min_bound for i in inds if min_bound <= i < min_bound + leng])
    )
    min_bound += leng
for ind, fil in enumerate(read_path):
    res_df.append(dd.read_csv(fil, skiprows=relative_inds[ind], sample=1000000))

Here I calculate indexes of rows I need to skip and then loading from csv using skiprows. This method is veeeery slow and sometime crashes if I need to read 0 rows from some small csv. But it guarantee exact number of rows.

Are there some fast solution about getting exact number of rows using Dask?

Mikhail_Sam
  • 10,602
  • 11
  • 66
  • 102

1 Answers1

0

I found the solution:

total_len = get_total_length() #compute len of all data in csvs
frac = nrows / total_len

while int(total_len * frac) != nrows:
    counter = 1
    frac = nrows / (total_len - counter)
    counter += 1

    res_df = dd.read_csv(read_path)
    res_df = res_df.sample(frac=0.0001)
    res_df = res_df.compute()

You can watch how to efficiently count number of rows in csv visit next link.

Mikhail_Sam
  • 10,602
  • 11
  • 66
  • 102