Convert a dask series to a list of values

Question

Is there a way to convert a series from a dask dataframe to a list, in order to iterate over that?

Until now I have:

ddf = dd.read_csv(MY_FILE)
s = ddf.iloc[:,[0]]
r = s.compute()
r.a_column.values

Thanks!

Could you elaborate on this? Will this list fit into memory? — SultanOrazbayev, Dec 21 '21 at 02:51
Sure, my problem is that is taking a lot of time to compute a large dataset, so my question is more focus on a 'native' way in Dask to iterate over a series without compute. — Cris Hernandez, Dec 21 '21 at 03:02
@CrisHernandez do you think suggested solution is optimal? I am curios because you accept the Ans — Coder, Aug 19 '22 at 15:42

SEUNGFWANI · Answer 1 · 2021-12-21T03:02:45.760

0

how about using inline for sentence? You can make new iterable object

you can get the values of Dataframe using values attribute.

ddf = dd.read_csv(MY_FILE)
s = ddf.iloc[:,[0]]
r = s.compute()
print([i[0] for i in r.values])

edited Dec 21 '21 at 03:02

answered Dec 21 '21 at 02:55

SEUNGFWANI

140
10

I got that. But, is there a way to take a Dask Series and transform that into a list of values, without doing compute previously? – Cris Hernandez Dec 21 '21 at 05:02
In my knowledge, the dask dataframe is lazy operation. it means the dask dataframe has schema but no data, itself. So if you want to get data from file(etc.), then you should use the `compute()` function that is action(load and execute) function. – SEUNGFWANI Dec 21 '21 at 06:13
I understand. Thanks mate. I think need to dig deeper in documentation. – Cris Hernandez Dec 21 '21 at 14:02

score 0 · Accepted Answer · answered Dec 21 '21 at 03:20

0

In general, it's preferable to avoid iterating over rows whenever possible (and use vectorized operations instead), see here. However, if the operations performed on elements of the row are independent of neighbouring rows, then the easiest thing to do in dask is .map_partition:

def myfunc(df):
    # apply row operations assuming df is a pandas df
    for index, row in df.iterrows():
        # do something
        something = 'some_value'
    return something

r = ddf.map_partitions(myfunc)

answered Dec 21 '21 at 03:20

SultanOrazbayev

14,900
3
16
46

I created a list to append the values inside the for loop, aimed to return that list. But, how could I get the values of that list which is representing a dask object? – Cris Hernandez Dec 21 '21 at 05:04
Hmmm, how are you going to use the list afterwards? (is it going to be a parallel operation also or not?) – SultanOrazbayev Dec 21 '21 at 05:07

Convert a dask series to a list of values

2 Answers2