pyspark duplicate a column on pyspark data frame

Question

In Scala I can simply duplicate a column in a DF like this:

val df = 
   spark.read.format("csv")
  .option("sep", ",")
  .option("inferSchema", "true")
  .option("header", "true")
  .option("samplingRatio", "1.0")
  .load("/FileStore/tables/diabetesPIMA.dat")

df.show(false)
val df2 = df.withColumn("age2", $"age")
df2.show()

How to do this simple copy in pyspark using withColumn?

Nothing seems to work and all posts do not work either. Odd, must be missing something, but as stated all posts do not work on Databricks.

Error message:

org.apache.spark.sql.AnalysisException: cannot resolve &#39;`age`&#39; given input columns: [ glucose, pregnancies,  insulin,  outcome,  BMI,  age,  diabetesPF,  skinThickness,  bloodPressure];;

for in pyspark (as per the answer which I already tried):

df = df.withColumn('age2', F.col('age'))
df.show()

which looks very similar to:

df = df.withColumn('col3', F.col('col2'))

koiralo · Accepted Answer · 2021-01-06T09:01:21.750

1

It looks like you might have an extra space in the column name, Instead of age you have age

Please check the schema and use it as below

df = df.withColumn('age2', F.col(' age'))
df.show()

Rather, please check ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace too skip the leading and trailing spaces.

edited Jan 06 '21 at 09:01

answered Jan 05 '21 at 21:41

koiralo

22,594
6
51
72

score 1 · Answer 2 · answered Jan 06 '21 at 07:26

To save yourself some headache, you can specify ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace when reading the csv file, which will strip all leading/trailing spaces from the headers and the contents.

e.g.

df = spark.read.csv(
    'file.csv',
    header=True,
    inferSchema=True,
    ignoreLeadingWhiteSpace=True,
    ignoreTrailingWhiteSpace=True
)

pyspark duplicate a column on pyspark data frame

2 Answers2