0

I have the following df and I want to filter out all the columns which contain https

df = spark.createDataFrame([
    ('https:john', 'john', 1.1, 'httpsasd'), 
    ('https:john', 'john', 1.2, 'httpsasd')
], ['website', 'name', 'value', 'other']
)

I have found an answer which does not address the str inside the column as the filter: PySpark drop columns based on column names / String condition

What I am looking for is an output as follows:

name       | value
--------------------
john       | 1.1    
pete       | 1.2    
blackbishop
  • 30,945
  • 11
  • 55
  • 76
Clay Campbell
  • 168
  • 13

1 Answers1

0

Use conditional count for each column to check if it contains at least one row with value like https then drop cols with count > 0:

from pyspark.sql import functions as F

cols_https_count = df.select(*[
    F.count(F.when(F.col(c).rlike("https"), 1)).alias(c)
    for c in df.columns
]).collect()[0].asDict()

cols_to_drop = [k for k, v in cols_https_count.items() if v > 0]

df = df.drop(*cols_to_drop)

df.show()
#+----+-----+
#|name|value|
#+----+-----+
#|john|  1.1|
#|john|  1.2|
#+----+-----+
blackbishop
  • 30,945
  • 11
  • 55
  • 76