1

I am using read_csv() to read a long list of csv files and return two dataframes. I have managed to speed up this action by using dask. Unfortunately, I have not been able to return multiple variables when using dask.

The minimum working example below replicates my issue:

@delayed(nout = 2)
def function(a):
  d = 0
  c = a + a
  if a>4: # random condition to make c and d of different lenghts
    d = a * a
  return pd.DataFrame([c])#, pd.DataFrame([d])

list = [1,2,3,4,5]

dfs = [delayed(function)(int) for int in list]
ddf = dd.from_delayed(dfs)
ddf.compute()

Any ideas to resolve this issue is appreciated. Thanks.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
Tanjil
  • 198
  • 1
  • 17
  • Right now, the snippet decorates `function` with `delayed` twice. That's going to cause problems, so best to remove one of them. I'd suggest leaving "@", but either is fine. – SultanOrazbayev Apr 02 '22 at 05:31
  • I understand, but would that line not be needed since I want to return two data frames from my function? – Tanjil Apr 02 '22 at 06:04
  • Nope, the nested decoration is unrelated to the number of delayed outputs. Check the examples on [this page](https://docs.dask.org/en/stable/delayed-best-practices.html) in section "Avoid calling delayed within delayed functions". – SultanOrazbayev Apr 02 '22 at 06:08
  • Hi Sultan, would it be possible to update the solution provided? I have deleted the second "delayed", but am still unable to return two dataframes from function(). – Tanjil Apr 02 '22 at 18:16
  • I updated the answer, if it doesn't help, then a bit more clarity on the errors will be needed. – SultanOrazbayev Apr 03 '22 at 09:45
  • I am sorry. My replies may have made matters very confusing. Would a teams chat be possible to iron out these issues? – Tanjil Apr 03 '22 at 21:45

1 Answers1

2

The delayed decorator has nout parameter, so something like this might work:

from dask import delayed

@delayed(nout=2)
def function(a,b):
  c = a + b
  d = a * b
  return c, d

delayed_c, delayed_d = function(2, 3)

From the question it's not clear at which step dataframes come in, but the key part of the question (returning more than one value from dask delayed) is answered by using nout, see this answer for full details.

Update:

The delayed function in the updated question returns a tuple of dataframes, this means that dd.from_delayed should be called either on each element of the tuple or the tuple should be unpacked:

dfs = [delayed_value for int in list for delayed_value in function(int)]
ddf = dd.from_delayed(dfs)
ddf.compute()
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Hi Sultan, Thank you for your answer. I have edited the example to indicate where the array of values are passed to the function and where the output dataframes are returned. – Tanjil Apr 01 '22 at 19:30