I am using pyspark 2.4.0, I have an dataframe with below columns
a,b,b
0,1,1.0
1,2,2.0
Without any join I have to keep only either one of b column and remove other b column
How can I achieve this
I am using pyspark 2.4.0, I have an dataframe with below columns
a,b,b
0,1,1.0
1,2,2.0
Without any join I have to keep only either one of b column and remove other b column
How can I achieve this
Perhaps this is helpful -
val df = Seq((0, 1, 1.0), (1, 2, 2.0)).toDF("a", "b", "b")
df.show(false)
df.printSchema()
/**
* +---+---+---+
* |a |b |b |
* +---+---+---+
* |0 |1 |1.0|
* |1 |2 |2.0|
* +---+---+---+
*
* root
* |-- a: integer (nullable = false)
* |-- b: integer (nullable = false)
* |-- b: double (nullable = false)
*/
df.toDF("a", "b", "b2").drop("b2").show(false)
/**
* +---+---+
* |a |b |
* +---+---+
* |0 |1 |
* |1 |2 |
* +---+---+
*/
i have been in the same situation when i made a jointure. the good practice is to rename the columns before joining the tables: you can refer to this link:
Spark Dataframe distinguish columns with duplicated name
selecting the one column from two columns of same name is confusing, so the good way to do it is to not have columns of same name in one dataframe.
try this :
col_select = list(set(df.columns))
df_fin = df.select(col_select)
This may help you,
Convert your DataFrame into RDD and extract the fields you want and convert back into DataFrame,
from pyspark.sql import Row
rdd = df.rdd.map(lambda l: Row(a=l[0], b=l[1]))
required_df = spark.creataDataFrame(rdd)
+---+---+
| a| b|
+---+---+
| 0| 1|
| 1| 2|
+---+---+
As a suppliment of Som's answer to automatically change multiple columns using cumcount:
ls_old = df.columns
pandas_df = pd.DataFrame({'ls_old':ls_old,'value':0})
pandas_df['ls_new'] = pandas_df['ls_old'] + ['' if i=='0' for i in pandas_df.groupby(['ls_old']).cumcount().astype(str)]
ls_new= list(df['ls_new'])
ls_keep = list(set(ls_new).intersection(ls_old))
df2 = df.toDF(*ls_new).select(ls_keep)