I want to create a subsample of n
rows using Dask. I tried 2 approaches:
1.Using frac
:
import dask.dataframe as dd
read_path = ["test_data\\small1.csv", "test_data\\small2.csv", "huge.csv"]
df = dd.read_csv(read_path)
df = df.sample(frac=0.0001)
df = df.compute()
It works fast enough - select 10000 from 100mil dataset for 16 secs. But it can't guarantee exact number of rows - it will be rounded because of using frac
.
2.Using for loop:
nrows = 10000
res_df = []
length = csv_loader.get_length()
total_len = sum(length)
start = perf_counter()
inds = random.sample(range(total_len), total_len - nrows - len(length))
min_bound = 0
relative_inds = []
for leng in length:
relative_inds.append(
sorted([i - min_bound for i in inds if min_bound <= i < min_bound + leng])
)
min_bound += leng
for ind, fil in enumerate(read_path):
res_df.append(dd.read_csv(fil, skiprows=relative_inds[ind], sample=1000000))
Here I calculate indexes of rows I need to skip and then loading from csv using skiprows
. This method is veeeery slow and sometime crashes if I need to read 0 rows from some small csv. But it guarantee exact number of rows.
Are there some fast solution about getting exact number of rows using Dask?