0

I'm trying to batch with the csv rows from dask:

Can this task be done with dask?

batch_size = 1000 # 1000rows
batch = []
count = 0

df = dd.read_csv (path, header = 0)
df_dask ['output'] = df.apply (lambda x: batch_row_csv (
         x), axis = 1, meta = object) .compute ()

def batch_row_csv (row):
       global batch 
       global count
       batch.append(row)
       if len (batch) < batch_size:
             return
       json.dump (batch) // save batch
       count = count +1
       batch = []
       return

Is there a problem with global variables and multiprocessing? In the good practices of Dask, they advise not to use global variables ... What would be the alternative?

Can this task be done with dask?

Francisco
  • 1,231
  • 1
  • 6
  • 7
  • Does this answer your question? [Using global variables in a function](https://stackoverflow.com/questions/423379/using-global-variables-in-a-function) –  Nov 12 '19 at 00:14
  • it is not recommended to use global variables from dask https://docs.dask.org/en/latest/delayed-best-practices.html – Francisco Nov 12 '19 at 00:26

0 Answers0