Assign conditional values to columns in Dask

Question

I am trying to do a conditional assignation to the rows of a specific column: target. I have done some research, and it seemed that the answer was given here: "How to do row processing and item assignment in dask".

I will reproduce my necessity. Mock data set:

x = [3, 0, 3, 4, 0, 0, 0, 2, 0, 0, 0, 6, 9]
y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
mock = pd.DataFrame(dict(target = x, speed = y))

The look of mock is:

In [4]: mock.head(7)
Out [4]:
      speed target
    0   200 3
    1   300 0
    2   400 3
    3   215 4
    4   219 0
    5   360 0
    6   280 0

Having this Pandas DataFrame, I convert it into a Dask DataFrame:

mock_dask = dd.from_pandas(mock, npartitions = 2)

I apply my conditional rule: all values in target above 0, must be 1, all others 0 (binaryze target). Following the mentioned thread above, it should be:

result = mock_dask.target.where(mock_dask.target > 0, 1)

I have a look at the result dataset and it is not working as expected:

In [7]: result.head(7)
Out [7]:
0    3
1    1
2    3
3    4
4    1
5    1
6    1
Name: target, dtype: object

As we can see, the column target in mock and result are not the expected results. It seems that my code is converting all 0 original values to 1, instead of the values that are greater than 0 into 1 (the conditional rule).

Dask newbie here, Thanks in advance for your help.

score 1 · Accepted Answer · answered May 09 '18 at 12:02

OK, the documentation in Dask DataFrame API is pretty clear. Thanks to @MRocklin feedback I have realized my mistake. In the documentation, where function (the last one in the list) is used with the following syntax:

DataFrame.where(cond[, other])      Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.

Thus, the correct code line would be:

result = mock_dask.target.where(mock_dask.target <= 0, 1)

This will output:

In [7]: result.head(7)
Out [7]:
0    1
1    0
2    1
3    1
4    0
5    0
6    0
Name: target, dtype: int64

Which is the expected output.

score 0 · Answer 2 · answered May 09 '18 at 11:02

They seem to be the same to me

In [1]: import pandas as pd

In [2]: x = [1, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0, 6, 9]
   ...: y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
   ...: mock = pd.DataFrame(dict(target = x, speed = y))
   ...: 

In [3]: import dask.dataframe as dd

In [4]: mock_dask = dd.from_pandas(mock, npartitions = 2)

In [5]: mock.target.where(mock.target > 0, 1).head(5)
Out[5]: 
0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [6]: mock_dask.target.where(mock_dask.target > 0, 1).head(5)
Out[6]: 
0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

Hi @MRocklin thanks for your response. Have edited my question for better understanding, it seems that my code line is converting all 0´s to 1´s. Instead of the values that are 1 or greater than 1 to 1, which is the desired output. — NuValue, May 09 '18 at 11:43

Assign conditional values to columns in Dask

2 Answers2