Droping columns based on some value in pyspark

Question

I have the following df and I want to filter out all the columns which contain https

df = spark.createDataFrame([
    ('https:john', 'john', 1.1, 'httpsasd'), 
    ('https:john', 'john', 1.2, 'httpsasd')
], ['website', 'name', 'value', 'other']
)

I have found an answer which does not address the str inside the column as the filter: PySpark drop columns based on column names / String condition

What I am looking for is an output as follows:

name       | value
--------------------
john       | 1.1    
pete       | 1.2

Can you post what is the final output you wan in a rough format — Praveen Kumar, Feb 16 '22 at 18:30

score 0 · Answer 1 · answered Feb 17 '22 at 09:43

Use conditional count for each column to check if it contains at least one row with value like https then drop cols with count > 0:

from pyspark.sql import functions as F

cols_https_count = df.select(*[
    F.count(F.when(F.col(c).rlike("https"), 1)).alias(c)
    for c in df.columns
]).collect()[0].asDict()

cols_to_drop = [k for k, v in cols_https_count.items() if v > 0]

df = df.drop(*cols_to_drop)

df.show()
#+----+-----+
#|name|value|
#+----+-----+
#|john|  1.1|
#|john|  1.2|
#+----+-----+

Droping columns based on some value in pyspark

1 Answers1