-2

In Scala I can simply duplicate a column in a DF like this:

val df = 
   spark.read.format("csv")
  .option("sep", ",")
  .option("inferSchema", "true")
  .option("header", "true")
  .option("samplingRatio", "1.0")
  .load("/FileStore/tables/diabetesPIMA.dat")

df.show(false)
val df2 = df.withColumn("age2", $"age")
df2.show()

How to do this simple copy in pyspark using withColumn?

Nothing seems to work and all posts do not work either. Odd, must be missing something, but as stated all posts do not work on Databricks.

Error message:

org.apache.spark.sql.AnalysisException: cannot resolve '`age`' given input columns: [ glucose, pregnancies,  insulin,  outcome,  BMI,  age,  diabetesPF,  skinThickness,  bloodPressure];;

for in pyspark (as per the answer which I already tried):

df = df.withColumn('age2', F.col('age'))
df.show()

which looks very similar to:

df = df.withColumn('col3', F.col('col2'))
thebluephantom
  • 16,458
  • 8
  • 40
  • 83

2 Answers2

1

It looks like you might have an extra space in the column name, Instead of age you have age

Please check the schema and use it as below

df = df.withColumn('age2', F.col(' age'))
df.show()

Rather, please check ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace too skip the leading and trailing spaces.

koiralo
  • 22,594
  • 6
  • 51
  • 72
1

To save yourself some headache, you can specify ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace when reading the csv file, which will strip all leading/trailing spaces from the headers and the contents.

e.g.

df = spark.read.csv(
    'file.csv',
    header=True,
    inferSchema=True,
    ignoreLeadingWhiteSpace=True,
    ignoreTrailingWhiteSpace=True
)
mck
  • 40,932
  • 13
  • 35
  • 50