0

I need to exclude rows which doesn't have True value in column Status.

In my opinion this filter( isin( )== False) structure should solve my problem but it doesn't.

df = sqlContext.createDataFrame([( "A", "True"), ( "A", "False"), ( "B", "False"), ("C",  "True")], ( "name", "status")) 
df.registerTempTable("df")
df_t = df[df.status == "True"]

from pyspark.sql import functions as sf

df_f = df.filter(df.status.isin(df_t.name)== False)

I expect row:

B | False

any help is greatly appreciated!

  • `isin` doesn't work like that. You need to use a `join`. – pault Jul 18 '19 at 13:41
  • Possible duplicate of [Spark replacement for EXISTS and IN](https://stackoverflow.com/questions/34861516/spark-replacement-for-exists-and-in) – pault Jul 18 '19 at 13:42

1 Answers1

0

First, I think in your last statement, you meant to use df.name instead of df.status.

df_f = df.filter(df.status.isin(df_t.name)== False)

Second, even if you use df.name, it still won't work. Because it's mixing the columns (Column type) from two DataFrames, i.e. df_t and df in your final statement. I don't think this works in pyspark. However, you can achieve the same effect using other methods. If I understand correctly, you want to select 'A' and 'C' first through 'status' column, then select the rows excluding ['A', 'C']. The thing here is to extend the selection to the second row of 'A', which can be achieved by Window. See below:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

df = sqlContext.createDataFrame([( "A", "True"), ( "A", "False"), ( "B", "False"), ("C",  "True")], ( "name", "status")) 
df.registerTempTable("df")
# create an auxiliary column satisfying the condition
df = df.withColumn("flag", F.when(df['status']=="True", 1).otherwise(0))
df.show()
# extend the selection to other rows with the same 'name'
df = df.withColumn('flag', F.max(df['flag']).over(Window.partitionBy('name')))
df.show()
#filter is now easy
df_f = df.filter(df.flag==0)
df_f.show()
niuer
  • 1,589
  • 2
  • 11
  • 14