0

I have a list of dictionaries of delayed. The computed value of each delayed object has to turn into an entry in the dask.Dataframe.

dfs = []

for source_list in list_of_list:
    values1 = {}
    values2 = {}

    for source in source_list:
        intermediate = dask.delayed(myfunc)(source)

        source_name = string_manipulation(source)
        values1[source_name] = dask.delayed(myfunc1)(intermediate)
        values2[source_name] = dask.delayed(myfunc2)(intermediate)

    df1 = dd.from_delayed(values1)  # TypeError: Expected Delayed object, got str
    df2 = dd.from_delayed(values2)
    df = dd.concat(df1, df2)
    df = df.T  # transpose function for dd?
    dfs.append(df)

dfs = dd.concat(dfs)
dfs = dfs.compute()

Normally pandas.DataFrame converts the key of a dictionary to columns. How can this be achieved in dask.DataFrame? Perhaps there are more efficient methods.

I appreciate your comment.

Dennis Kozevnikoff
  • 2,078
  • 3
  • 19
  • 29
Simon
  • 703
  • 2
  • 8
  • 19
  • I think this has been answered in https://stackoverflow.com/questions/59377561/create-a-dask-dataframe-from-a-dictionary – quasiben Jul 28 '20 at 15:45
  • I think the difference comes from the delayed object as the dictionary element. Is there a way to compute only after `dask.DataFrame` is created for more efficient computation? – Simon Jul 28 '20 at 19:37

1 Answers1

0

dd.from_delayed expects a list of delayed objects, each of which returns a pandas dataframe. You are providing a dictionary of delayed objects, hence the error.

You need to create a list of delayed objects, each of which would produce a pandas dataframe when computed. All of those pandas dataframes should have identical columns and types.

MRocklin
  • 55,641
  • 23
  • 163
  • 235