Dict to dask dataframe

Question

I have a list of dictionaries of delayed. The computed value of each delayed object has to turn into an entry in the dask.Dataframe.

dfs = []

for source_list in list_of_list:
    values1 = {}
    values2 = {}

    for source in source_list:
        intermediate = dask.delayed(myfunc)(source)

        source_name = string_manipulation(source)
        values1[source_name] = dask.delayed(myfunc1)(intermediate)
        values2[source_name] = dask.delayed(myfunc2)(intermediate)

    df1 = dd.from_delayed(values1)  # TypeError: Expected Delayed object, got str
    df2 = dd.from_delayed(values2)
    df = dd.concat(df1, df2)
    df = df.T  # transpose function for dd?
    dfs.append(df)

dfs = dd.concat(dfs)
dfs = dfs.compute()

Normally pandas.DataFrame converts the key of a dictionary to columns. How can this be achieved in dask.DataFrame? Perhaps there are more efficient methods.

I appreciate your comment.

I think this has been answered in https://stackoverflow.com/questions/59377561/create-a-dask-dataframe-from-a-dictionary — quasiben, Jul 28 '20 at 15:45
I think the difference comes from the delayed object as the dictionary element. Is there a way to compute only after `dask.DataFrame` is created for more efficient computation? — Simon, Jul 28 '20 at 19:37

score 0 · Answer 1 · answered Aug 08 '20 at 00:57

dd.from_delayed expects a list of delayed objects, each of which returns a pandas dataframe. You are providing a dictionary of delayed objects, hence the error.

You need to create a list of delayed objects, each of which would produce a pandas dataframe when computed. All of those pandas dataframes should have identical columns and types.

Dict to dask dataframe

1 Answers1