0

I am trying to do a conditional assignation to the rows of a specific column: target. I have done some research, and it seemed that the answer was given here: "How to do row processing and item assignment in dask".

I will reproduce my necessity. Mock data set:

x = [3, 0, 3, 4, 0, 0, 0, 2, 0, 0, 0, 6, 9]
y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
mock = pd.DataFrame(dict(target = x, speed = y))

The look of mock is:

In [4]: mock.head(7)
Out [4]:
      speed target
    0   200 3
    1   300 0
    2   400 3
    3   215 4
    4   219 0
    5   360 0
    6   280 0

Having this Pandas DataFrame, I convert it into a Dask DataFrame:

mock_dask = dd.from_pandas(mock, npartitions = 2)

I apply my conditional rule: all values in target above 0, must be 1, all others 0 (binaryze target). Following the mentioned thread above, it should be:

result = mock_dask.target.where(mock_dask.target > 0, 1)

I have a look at the result dataset and it is not working as expected:

In [7]: result.head(7)
Out [7]:
0    3
1    1
2    3
3    4
4    1
5    1
6    1
Name: target, dtype: object 

As we can see, the column target in mock and result are not the expected results. It seems that my code is converting all 0 original values to 1, instead of the values that are greater than 0 into 1 (the conditional rule).

Dask newbie here, Thanks in advance for your help.

NuValue
  • 453
  • 3
  • 11
  • 28

2 Answers2

1

OK, the documentation in Dask DataFrame API is pretty clear. Thanks to @MRocklin feedback I have realized my mistake. In the documentation, where function (the last one in the list) is used with the following syntax:

DataFrame.where(cond[, other])      Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.

Thus, the correct code line would be:

result = mock_dask.target.where(mock_dask.target <= 0, 1) 

This will output:

In [7]: result.head(7)
Out [7]:
0    1
1    0
2    1
3    1
4    0
5    0
6    0
Name: target, dtype: int64

Which is the expected output.

NuValue
  • 453
  • 3
  • 11
  • 28
0

They seem to be the same to me

In [1]: import pandas as pd

In [2]: x = [1, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0, 6, 9]
   ...: y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
   ...: mock = pd.DataFrame(dict(target = x, speed = y))
   ...: 

In [3]: import dask.dataframe as dd

In [4]: mock_dask = dd.from_pandas(mock, npartitions = 2)

In [5]: mock.target.where(mock.target > 0, 1).head(5)
Out[5]: 
0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [6]: mock_dask.target.where(mock_dask.target > 0, 1).head(5)
Out[6]: 
0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Hi @MRocklin thanks for your response. Have edited my question for better understanding, it seems that my code line is converting all 0´s to 1´s. Instead of the values that are 1 or greater than 1 to 1, which is the desired output. – NuValue May 09 '18 at 11:43