-1

I was converting some pandas series and pandas dataframes to koalas for scalability. But in places where i was using np.where() I tried to pass koalas dataframe like it was previously passing pandas dataframe. But I got an error an PandasNotImplementedError.

How can I overcome this error? I tried ks.where() but it didn’t work.

Here is model of the code I am working on using pandas.

import pandas as pd
import numpy as np
pdf = np.where(condition, action1, action2)

The code is working if I convert the koalas back to pandas using toPandas() or from_pandas(), but due to performance and scalability reasons I can’t use pandas. If possible please suggest me an alternative approach in Koalas or an alternative library for numpy which can do this that works well with koalas.

James Z
  • 12,209
  • 10
  • 24
  • 44

2 Answers2

1

As per the documentation on Koalas (1.8.2), the where function on databricks.koalas.DataFrame and databricks.koalas.Series accepts only two arguments, condition and value when condition is False. Wherever the condition is True, the value is not changed. It behaves similar to how it behaves in Pandas.

Hence, a chaining of where statements can be used like this:

kdf.where(condition, action2).where(~condition, action1)
# action1 --> Action when condition is True.
# action2 --> Action when condition is False.

# The output of this cannot be assigned back to a column though. To assign the output to some column, the where has to be applied on a Series.
kdf['some_column'].where(condition, action2).where(~condition, action1)

Also, note that on Koalas, the where condition on databricks.koalas.Series can be assigned back to a column, but not the output of where condition when applied on a databricks.koalas.DataFrame, like can be done in Pandas in your case.

0

I'm not too familiar with koalas, but I think something using DataFrame.where() would work.

e.g.

from databricks.koalas.config import set_option, reset_option
set_option("compute.ops_on_diff_frames", True)
df1 = ks.DataFrame({'A': [0, 1, 2, 3, 4], 'B':[100, 200, 300, 400, 500]})
df2 = ks.DataFrame({'A': [0, -1, -2, -3, -4], 'B':[-100, -200, -300, -400, -500]})
df1.where(df1 > 1, df2)

There's also a corresponding koalas Series.where() if that's what you need.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • I tried it earlier. But I’m getting TypeError : where() takes from 2 to 3 positional arguments but 4 were given. – Favaz Musthafa Dec 29 '21 at 03:35
  • @FavazMusthafa Have you tried giving it fewer arguments? – Nick ODell Dec 29 '21 at 03:36
  • I’m trying to covert an existing logic in pandas to koalas. So the existing code is pdf=np.where(condition, arg1, arg2). But for the same condition koalas it is only accepting kdf=kdf.where(condition, arg1) – Favaz Musthafa Dec 29 '21 at 03:40
  • @FavazMusthafa The idea is that arg1 is your existing dataframe, and it uses the condition to choose between the existing dataframe and arg2, the other dataframe. It has exactly as many arguments, it's just that one of the arguments goes before the `.where`. – Nick ODell Dec 29 '21 at 03:42