Is it possible to batch the rows of a csv in dask?

Asked Nov 12 '19 at 00:05

Active Nov 12 '19 at 13:23

Viewed 122 times

I'm trying to batch with the csv rows from dask:

Can this task be done with dask?

batch_size = 1000 # 1000rows
batch = []
count = 0

df = dd.read_csv (path, header = 0)
df_dask ['output'] = df.apply (lambda x: batch_row_csv (
         x), axis = 1, meta = object) .compute ()

def batch_row_csv (row):
       global batch 
       global count
       batch.append(row)
       if len (batch) < batch_size:
             return
       json.dump (batch) // save batch
       count = count +1
       batch = []
       return

Is there a problem with global variables and multiprocessing? In the good practices of Dask, they advise not to use global variables ... What would be the alternative?

Can this task be done with dask?

edited Nov 12 '19 at 13:23

asked Nov 12 '19 at 00:05

Francisco

1,231
1
6
7

Does this answer your question? [Using global variables in a function](https://stackoverflow.com/questions/423379/using-global-variables-in-a-function) – Nov 12 '19 at 00:14
it is not recommended to use global variables from dask https://docs.dask.org/en/latest/delayed-best-practices.html – Francisco Nov 12 '19 at 00:26

Is it possible to batch the rows of a csv in dask?

0 Answers0