How to use multiple columns in filter and lambda functions pyspark

Question

I have a dataframe, in which I want to delete columns whose name starts with "test","id_1","vehicle" and so on

I use below code to delete one column

df1.drop(*filter(lambda col: 'test' in col, df.columns))

how to specify all columns at once in this line? this doesnt work:

df1.drop(*filter(lambda col: 'test','id_1' in col, df.columns))

score 0 · Answer 1 · answered Feb 25 '20 at 18:48

0

You do something like the following:

expression = lambda col: all([col.startswith(i) for i in ['test', 'id_1', 'vehicle']])
df1.drop(*filter(lambda col:  expression(col), df.columns))

answered Feb 25 '20 at 18:48

ggeop

1,230
12
24

bqbastos · Answer 2 · 2020-02-25T19:19:17.690

In PySpark version 2.1.0, it is possible to drop multiple columns using drop by providing a list of strings (with the names of the columns you want to drop) as argument to drop. (See documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=drop#pyspark.sql.DataFrame.drop).

In your case, you may create a list containing the names of the columns you want to drop. For example:

cols_to_drop = [x for x in colunas if (x.startswith('test') or x.startswith('id_1') or x.startswith('vehicle'))]

And then apply the drop unpacking the list:

df1.drop(*cols_to_drop)

Ultimately, it is also possible to achieve a similar result by using select. For example:

# Define columns you want to keep
cols_to_keep = [x for x in df.columns if x not in cols_to_drop]

# create new dataframe, df2, that keeps only the desired columns from df1
df2 = df1.select(cols_to_keep)

Note that, by using select you don't need to unpack the list.

Please note that this question also address similar issue.

I hope this helps.

score 0 · Answer 3 · answered Feb 25 '20 at 20:26

0

Well, it seems you can use regular column filter as following:

val forColumns = df.columns.filter(x => (x.startsWith("test") || x.startsWith("id_1") || x.startsWith("vehicle"))) ++ ["c_007"]

df.drop(*forColumns)

answered Feb 25 '20 at 20:26

OO7

660
4
10

How to use multiple columns in filter and lambda functions pyspark

3 Answers3