7

I tried to follow this answer but my question is slightly different.

I have two pyspark data frames df2 and bears2. Both have an integer variable, and I want to create a boolean like this pseudocode:

df3 = df2.withColumn("game", (df2.week_id.isin(bears2.week_if), 1,0))

Basically, if the value of df2 exists in the corresponding column of bears2, I want a 1 else a 0

I tried the expr() from the other question, but wasn't able to get it to work. It looked like this:

new_column_1 = F.expr(
    """IF(df2.week_id IN(bears2.week_if), 1, 0))"""
    )
pault
  • 41,343
  • 15
  • 107
  • 149
mlewis
  • 101
  • 1
  • 2
  • 6
  • you've forgot `withColumn(...., when(condition, value).otherwise(another_value))` – vvg May 30 '18 at 13:37
  • I tried this based on another answer (which was not deleted). Do I need to convert bears2.week_if into a list/vector for this to work? Because I get an error. – mlewis May 30 '18 at 13:44
  • @mlewis `IN` will not work for dataframes- you'll have to use join. I'm updating the answer. – pault May 30 '18 at 13:45
  • `isin` expects list as a parameter. if you want to check with another dataframe - you can do join on it. – vvg May 30 '18 at 13:50

1 Answers1

6

You can't currently use IN like that with pyspark-sql1. Instead, you will have to join the DataFrames.

Try something like:

from pyspark.sql.functions import col, when
df3 = df2.withColumn("id", col("week_id")).alias("df2")\
    .join(bears2.withColumn("id", col("week_if")).alias("bears2"), on="id", how="left")\
    .select("df2.*", when(col("bears2.id").isNotNull(), 1).otherwise(0))

For the join to work, the join key column has to exist in both DataFrames. For that reason, I first call withColumn("id", ...) to rename the columns to the same value for the join.

Next we do a LEFT join to keep all of the columns in df2. Finally we select all of the columns in df2 and use pyspark.sql.functions.when() to create the boolean column.

The first argument to when() is a condition. If it is True, the second argument is returned. If not, the value in otherwise() is used.

pault
  • 41,343
  • 15
  • 107
  • 149
  • The output of the join looks good, but the when statement returns `TypeError: condition should be a Column` Should I try one of the methods in the previous answer of creating the column? – mlewis May 30 '18 at 14:03
  • @mlewis try the update- I changed the sql condition to `isNotNull()` instead. – pault May 30 '18 at 14:05