How to delete columns in pyspark dataframe

Question

>>> a
DataFrame[id: bigint, julian_date: string, user_id: bigint]
>>> b
DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint]
>>> a.join(b, a.id==b.id, 'outer')
DataFrame[id: bigint, julian_date: string, user_id: bigint, id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint]

There are two id: bigint and I want to delete one. How can I do?

score 187 · Answer 1 · edited Oct 31 '21 at 03:55

187

Reading the Spark documentation I found an easier solution.

Since version 1.4 of spark there is a function drop(col) which can be used in pyspark on a dataframe.

You can use it in two ways

df.drop('age')
df.drop(df.age)

Pyspark Documentation - Drop

edited Oct 31 '21 at 03:55

qwr

9,525
5
58
102

answered Sep 18 '15 at 13:32

Patrick C.

2,221
1
11
15

33

when the data size is large, collect() might cause heap space error. you can also create a new dataframe dropping the extra field by `ndf = df.drop('age')` – mnis.p Jul 25 '18 at 11:44
2

There is absolutely no reason to use `collect` for this operation so I removed it from this answer – qwr Oct 31 '21 at 03:54

Clock Slave · Answer 2 · 2019-03-01T17:36:28.080

154

Adding to @Patrick's answer, you can use the following to drop multiple columns

columns_to_drop = ['id', 'id_copy']
df = df.drop(*columns_to_drop)

edited Mar 01 '19 at 17:36

answered May 23 '18 at 15:56

Clock Slave

7,627
15
68
109

4

I had to reassign the drop results back to the dataframe: df = df.drop(*columns_to_drop) – avgbody Jul 26 '18 at 02:06
1

Note that you will not get an error if the column does not exist – Guido Mar 31 '20 at 09:29
I get an error saying `TreeNodeException: Binding attribute, tree: _gen_alias_34#34` after I drop a column, and use `.show()` – frlzjosh May 07 '20 at 06:05
2

What the asterisk `*` means in `*columns_to_drop`? – Juan-Kabbali Feb 17 '21 at 00:31
3

The `*` is to unpack the list. `(*[a,b,c])` becomes `(a,b,c)` – Clock Slave Feb 17 '21 at 03:37

score 36 · Answer 3 · edited Mar 11 '16 at 00:47

36

An easy way to do this is to user "select" and realize you can get a list of all columns for the dataframe, df, with df.columns

drop_list = ['a column', 'another column', ...]

df.select([column for column in df.columns if column not in drop_list])

edited Mar 11 '16 at 00:47

Haroldo Gondim

7,725
9
43
62

answered Mar 10 '16 at 23:26

ev.per.baryon

361
3
2

1

Thank-you, this works great for me for removing duplicate columns with the same name as another column, where I use `df.select([df.columns[column_num] for column_num in range(len(df.columns)) if column_num!=2])`, where the column I want to remove has index 2. – Shane Halloran Nov 07 '17 at 14:19

score 24 · Answer 4 · answered Aug 27 '18 at 09:35

You can use two way:

1: You just keep the necessary columns:

drop_column_list = ["drop_column"]
df = df.select([column for column in df.columns if column not in drop_column_list])

2: This is the more elegant way.

df = df.drop("col_name")

You should avoid the collect() version, because it will send to the master the complete dataset, it will take a big computing effort!

score 14 · Answer 5 · answered Apr 14 '15 at 07:26

14

You could either explicitly name the columns you want to keep, like so:

keep = [a.id, a.julian_date, a.user_id, b.quan_created_money, b.quan_created_cnt]

Or in a more general approach you'd include all columns except for a specific one via a list comprehension. For example like this (excluding the id column from b):

keep = [a[c] for c in a.columns] + [b[c] for c in b.columns if c != 'id']

Finally you make a selection on your join result:

d = a.join(b, a.id==b.id, 'outer').select(*keep)

answered Apr 14 '15 at 07:26

karlson

5,325
3
30
62

I think I got the answer. Select needs to take a list of strings NOT a list of columns. So do this: `keep = [c for c in a.columns] + [c for c in b.columns if c != 'id']` `d = a.join(b, a.id==b.id, 'outer').select(*keep)` – deusxmach1na Apr 17 '15 at 23:02
Well, that should do exactly the same thing as my answer, as I'm pretty sure that `select` accepts either strings OR columns (https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame). Btw, in your line `keep = ...` there's no need to use a list comprehension for `a`: `a.columns + [c for c in b.columns if c != 'id']` should achieve the exact same thing, as `a.columns` is already a `list` of strings. – karlson Apr 18 '15 at 09:23
@deusxmach1na Actually the column selection based on strings cannot work for the OP, because that would not solve the ambiguity of the `id` column. In that case you have to use the `Column` instances in `select`. – karlson Apr 20 '15 at 08:46
All good points. I tried your solution in Spark 1.3 and got errors, so what I posted actually worked for me. And to resolve the id ambiguity I renamed my id column before the join then dropped it after the join using the keep list. HTH anyone else that was stuck like I was. – deusxmach1na Apr 21 '15 at 23:17

score 4 · Answer 6 · answered Jul 07 '15 at 19:50

Maybe a little bit off topic, but here is the solution using Scala. Make an Array of column names from your oldDataFrame and delete the columns that you want to drop ("colExclude"). Then pass the Array[Column] to select and unpack it.

val columnsToKeep: Array[Column] = oldDataFrame.columns.diff(Array("colExclude"))
                                               .map(x => oldDataFrame.col(x))
val newDataFrame: DataFrame = oldDataFrame.select(columnsToKeep: _*)

score 2 · Answer 7 · answered Mar 13 '21 at 06:33

Yes, it is possible to drop/select columns by slicing like this:

slice = data.columns[a:b]

data.select(slice).show()

Example:

newDF = spark.createDataFrame([
                           (1, "a", "4", 0), 
                            (2, "b", "10", 3), 
                            (7, "b", "4", 1), 
                            (7, "d", "4", 9)],
                            ("id", "x1", "x2", "y"))


slice = newDF.columns[1:3]
newDF.select(slice).show()

Use select method to get features column:

features = newDF.columns[:-1]
newDF.select(features).show()

Use drop method to get last column:

last_col= newDF.drop(*features)
last_col.show()

score -1 · Answer 8 · answered Apr 13 '19 at 21:21

Consider 2 dataFrames:

>>> aDF.show()
+---+----+
| id|datA|
+---+----+
|  1|  a1|
|  2|  a2|
|  3|  a3|
+---+----+

and

>>> bDF.show()
+---+----+
| id|datB|
+---+----+
|  2|  b2|
|  3|  b3|
|  4|  b4|
+---+----+

To accomplish what you are looking for, there are 2 ways:

1. Different joining condition. Instead of saying aDF.id == bDF.id

aDF.join(bDF, aDF.id == bDF.id, "outer")

Write this:

aDF.join(bDF, "id", "outer").show()
+---+----+----+
| id|datA|datB|
+---+----+----+
|  1|  a1|null|
|  3|  a3|  b3|
|  2|  a2|  b2|
|  4|null|  b4|
+---+----+----+

This will automatically get rid of the extra the dropping process.

2. Use Aliasing: You will lose data related to B Specific Id's in this.

>>> from pyspark.sql.functions import col
>>> aDF.alias("a").join(bDF.alias("b"), aDF.id == bDF.id, "outer").drop(col("b.id")).show()

+----+----+----+
|  id|datA|datB|
+----+----+----+
|   1|  a1|null|
|   3|  a3|  b3|
|   2|  a2|  b2|
|null|null|  b4|
+----+----+----+

score -2 · Answer 9 · answered Jun 06 '19 at 08:14

-2

You can delete column like this:

df.drop("column Name).columns

In your case :

df.drop("id").columns

If you want to drop more than one column you can do:

dfWithLongColName.drop("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME")

answered Jun 06 '19 at 08:14

techgeek

23
3

1

Spark 2.4 (and least versions) doesn't accepts more than one column name. – seufagner Jun 16 '19 at 22:51
Is it possible to drop columns by index ? – DataBach Apr 09 '20 at 15:19
1

@seufagner it does just pass it as a list – Topde Sep 11 '20 at 08:43

How to delete columns in pyspark dataframe

9 Answers9

Linked