In PySpark version 2.1.0, it is possible to drop multiple columns using drop
by providing a list of strings (with the names of the columns you want to drop) as argument to drop
. (See documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=drop#pyspark.sql.DataFrame.drop).
In your case, you may create a list containing the names of the columns you want to drop. For example:
cols_to_drop = [x for x in colunas if (x.startswith('test') or x.startswith('id_1') or x.startswith('vehicle'))]
And then apply the drop
unpacking the list:
df1.drop(*cols_to_drop)
Ultimately, it is also possible to achieve a similar result by using select
. For example:
# Define columns you want to keep
cols_to_keep = [x for x in df.columns if x not in cols_to_drop]
# create new dataframe, df2, that keeps only the desired columns from df1
df2 = df1.select(cols_to_keep)
Note that, by using select
you don't need to unpack the list.
Please note that this question also address similar issue.
I hope this helps.