1

I am trying to read the first row from a file and then filter that from the dataframe.

I am using take(1) to read the first row. I then want to filter this from the dataframe (it could appear multiple times within the dataset).

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext(appName = "solution01")
spark = SparkSession(sc)

df1 = spark.read.csv("/Users/abc/test.csv")
header = df1.take(1)
print(header)

final_df = df1.filter(lambda x: x != header)
final_df.show()

However I get the following error TypeError: condition should be string or Column.

I was trying to follow the answer from Nicky here How to skip more then one lines of header in RDD in Spark

The data looks like (but will have multiple columns that i need to do the same for):

customer_id
1
2
3
customer_id
4
customer_id
5

I want the result as:

1
2
3
4
5
notNull
  • 30,258
  • 4
  • 35
  • 50
nsc060
  • 417
  • 6
  • 16

1 Answers1

2

take on dataframe results list(Row) we need to get the value use [0][0] and In filter clause use column_name and filter the rows which are not equal to header

header = df1.take(1)[0][0]
#filter out rows that are not equal to header
final_df = df1.filter(col("<col_name>") != header)
final_df.show()
notNull
  • 30,258
  • 4
  • 35
  • 50