How to drop a column from a spark dataframe by index where column names can be duplicated?

Question

I have a spark dataframe and want to drop only the last column.

I tried

df.drop(df.columns.last)`

but got error AttributeError: 'list' object has no attribute 'last'.

I also tried:

df = df.drop(df.columns[-1])

but this dropped all columns with that has same name as last.

Using Spark 2.4

It is better to drop by name. `withColumn` can alter the order of the columns — Salim, Jan 23 '20 at 16:46
@Joe I would recommend the following: 1) Save the column names to a list: `colnames = df.columns` 2) rename the columns so the names are unique: `df = df.toDF(*range(colnames))` 3) drop the last column `df = df.drop(df.columns[-1])` 4) rename the columns back to the original: `df = df.toDF(*cols[:-1])`. Ping me if the question is reopened and I will post an answer. — pault, Jan 23 '20 at 18:26

score 4 · Answer 1 · answered Jan 23 '20 at 19:40

Here is an approach you can take to drop any column by index.

Suppose you had the following DataFrame:

np.random.seed(1)
data = np.random.randint(0, 10, size=(3,3))

df = spark.createDataFrame(data.astype(int).tolist(), ["a", "b", "a"])
df.show()
#+---+---+---+
#|  a|  b|  a|
#+---+---+---+
#|  5|  8|  9|
#|  5|  0|  0|
#|  1|  7|  6|
#+---+---+---+

First save the original column names.

colnames = df.columns
print(colnames)
#['a', 'b', 'a']

Then rename all of the columns in the DataFrame using range so the new column names are unique (they will simply be the column index).

df = df.toDF(*map(str, range(len(colnames))))
print(df.columns)
#['0', '1', '2']

Now drop the last column and rename the columns using the saved column names from the first step (excluding the last column).

df = df.drop(df.columns[-1]).toDF(*colnames[:-1])
df.show()
#+---+---+
#|  a|  b|
#+---+---+
#|  5|  8|
#|  5|  0|
#|  1|  7|
#+---+---+

You can easily expand this to any index, since we renamed using range.

I broke it up into steps for explaination purposes, but you can also do this more compactly as follows:

colnames = df.columns
df = df.toDF(*map(str, range(len(colnames))))\
    .drop(str(len(colnames)-1))\
    .toDF(*colnames[:-1])

Salim · Answer 2 · 2020-01-23T17:07:14.300

3

It is better to drop a column by name. Some operation like withColumn can alter the order of the columns. If a dataframe has duplicate names coming out from a join then refer the column by dataframe.column_name instead of referring it by "columnName" which causes ambiguity.

df3 = df1.join(df2, df1.c1 == df2.c1).drop(df2.c1)

In general df.drop(df.columnName)

edited Jan 23 '20 at 17:07

answered Jan 23 '20 at 16:49

Salim

2,046
12
13

Column name could be ambiguous since there will be 2 columns with name `c1` here. – Joe Jan 23 '20 at 17:01
1

if I do `df2.drop('C1')` then its ambiguous. But if I do `df2.drop(df2.C1)` then it is not. Please try. – Salim Jan 23 '20 at 17:05
I will need example and spark version to reproduce. – Salim Jan 28 '20 at 15:25
If the df has multiple columns with the same name and are thus ambiguous, this doesn't work. This is a circumstantial answer that works after this particular join. – Jomonsugi Sep 23 '22 at 20:48

How to drop a column from a spark dataframe by index where column names can be duplicated?

2 Answers2