I'm trying to batch with the csv rows from dask:
Can this task be done with dask?
batch_size = 1000 # 1000rows
batch = []
count = 0
df = dd.read_csv (path, header = 0)
df_dask ['output'] = df.apply (lambda x: batch_row_csv (
x), axis = 1, meta = object) .compute ()
def batch_row_csv (row):
global batch
global count
batch.append(row)
if len (batch) < batch_size:
return
json.dump (batch) // save batch
count = count +1
batch = []
return
Is there a problem with global variables and multiprocessing? In the good practices of Dask, they advise not to use global variables ... What would be the alternative?
Can this task be done with dask?