pyspark drop duplicate column

Question

I am using pyspark 2.4.0, I have an dataframe with below columns

a,b,b
0,1,1.0
1,2,2.0

Without any join I have to keep only either one of b column and remove other b column

How can I achieve this

you have to avoid this, because a column selection by name is simply not possible when you have duplicates. If this is the result of a join, you can define prefixes or suffixes for column names. On this way you have a unique selector for 'b' — b0lle, Jul 16 '20 at 05:36

score 1 · Answer 1 · answered Jul 16 '20 at 09:24

1

Perhaps this is helpful -


 val df = Seq((0, 1, 1.0), (1, 2, 2.0)).toDF("a", "b", "b")
 df.show(false)
    df.printSchema()

    /**
      * +---+---+---+
      * |a  |b  |b  |
      * +---+---+---+
      * |0  |1  |1.0|
      * |1  |2  |2.0|
      * +---+---+---+
      *
      * root
      * |-- a: integer (nullable = false)
      * |-- b: integer (nullable = false)
      * |-- b: double (nullable = false)
      */
    df.toDF("a", "b", "b2").drop("b2").show(false)
    /**
      * +---+---+
      * |a  |b  |
      * +---+---+
      * |0  |1  |
      * |1  |2  |
      * +---+---+
      */

answered Jul 16 '20 at 09:24

Som

6,193
1
11
22

I have around 200 columns with few of multiple pairs like this . Yes one option is to rename and drop . Manual effort is somewhere required. For me I have to identify and rename the list[column names] accordingly – Naveen Srikanth Jul 17 '20 at 11:02
I think its easy to do for multiple columns. Give atry – Som Jul 17 '20 at 12:53

score 1 · Answer 2 · answered Jul 16 '20 at 09:32

i have been in the same situation when i made a jointure. the good practice is to rename the columns before joining the tables: you can refer to this link:

Spark Dataframe distinguish columns with duplicated name

selecting the one column from two columns of same name is confusing, so the good way to do it is to not have columns of same name in one dataframe.

score 0 · Answer 3 · answered Jul 16 '20 at 05:35

0

try this :

col_select = list(set(df.columns))
df_fin = df.select(col_select)

answered Jul 16 '20 at 05:35

Raghu

1,644
7
19

No still I get as ambihious – Naveen Srikanth Jul 16 '20 at 05:40
can you post these two output : ````col_list = df.columns```` ````col_sel = list(set(col_list))```` – Raghu Jul 16 '20 at 06:21

score 0 · Answer 4 · answered Jul 16 '20 at 06:47

This may help you,

Convert your DataFrame into RDD and extract the fields you want and convert back into DataFrame,

from pyspark.sql import Row

rdd = df.rdd.map(lambda l: Row(a=l[0], b=l[1]))

required_df = spark.creataDataFrame(rdd)

+---+---+
|  a|  b|
+---+---+
|  0|  1|
|  1|  2|
+---+---+

score 0 · Answer 5 · answered Mar 29 '23 at 08:01

As a suppliment of Som's answer to automatically change multiple columns using cumcount:

ls_old = df.columns
pandas_df = pd.DataFrame({'ls_old':ls_old,'value':0})
pandas_df['ls_new'] = pandas_df['ls_old'] + ['' if i=='0' for i in pandas_df.groupby(['ls_old']).cumcount().astype(str)]
ls_new= list(df['ls_new'])
ls_keep = list(set(ls_new).intersection(ls_old))
df2 = df.toDF(*ls_new).select(ls_keep)

pyspark drop duplicate column

5 Answers5