1

I have a list of booleans

unique_df1 = [True, True, False .... ,False, True]

I have a pyspark dataframe, df1:

type(df1) = pyspark.sql.dataframe.DataFrame

The lengths are compatible:

len(unique_df1) == df1.count()

How do I create a new dataframe, using unique_df1, to choose which rows will be in the new dataframe?

To do this with a pandas data frame:

import pandas as pd

lst = ['Geeks', 'For', 'Geeks', 'is', 
            'portal', 'for', 'Geeks']

df1 = pd.DataFrame(lst)

unique_df1 = [True, False] * 3 + [True]

new_df = df1[unique_df1]

I can't find the similar syntax for a pyspark.sql.dataframe.DataFrame. I have tried with too many code snippets to count. How do I do this in pyspark?

cs95
  • 379,657
  • 97
  • 704
  • 746
Clark Sims
  • 51
  • 4

1 Answers1

0

Unfortunately boolean indexing as shown in pandas is not directly available in pyspark. Your best option is to add the mask as a column to the existing DataFrame and then use df.filter

from pyspark.sql import functions as F

mask = [True, False, ...]
maskdf = sqlContext.createDataFrame([(m,) for m in mask], ['mask'])

df = df.withColumn("idx", F.monotonically_increasing_id())
maskdf = maskdf.withColumn("idx", F.monotonically_increasing_id())

filtered = (df.join(maskdf, df.idx == maskdf.idx)
              .filter('mask')
              .drop("idx", "mask"))

Reference

cs95
  • 379,657
  • 97
  • 704
  • 746