0

I'm reading a sql table in a notebook on Synapse and loading it in a pyspark dataframe:

df = spark.read.synapsesql("dbtablename")

Unfortunately some columns have a space in their name e.g.: Job Title.

I tried different methods to change the name of the columns and remove the space. I can see the new dataframes with the fixed column name but when I try to convert to Polars or Pandas dataframe the operation fails and the error message refers back to the original name of the column with the space in the first dataframe.

I tried with:

  • Select Col Alias
  • Select Expr
  • WithColumnRenamed
  • I tried sqlContext.registerDataFrameAsTable(df, "tablename") and then df_new = sqlContext.sql("select JobTitle from table")

Thank you for your kind help.

Koedlt
  • 4,286
  • 8
  • 15
  • 33
datatalian
  • 83
  • 1
  • 5
  • 1
    withColumnRenamed should work. Can you provide the exact code you are using to rename columns using withColumnRenamed? – Saideep Arikontham Mar 25 '23 at 04:55
  • Does this answer your question? [How to change dataframe column names in PySpark?](https://stackoverflow.com/questions/34077353/how-to-change-dataframe-column-names-in-pyspark) – Koedlt Mar 26 '23 at 11:21
  • This is the code I'm using:
    df = spark.read.synapsesql("db.tablename") df_nospace = df.withColumnRenamed("Job Title", "JobTitle") pandas_df = df_nospace.toPandas() This last command results in an error referring the "Job Title" column, with the space. When I print the schema of df_nospace, "JobTitle" is the column name. To note is that if I simply use: df.withColumnRenamed("Job Title", "JobTitle"), the schema will still show "Job Title". I tried to delete and unpersist the first dataframe too but with no success. Thank you.
    – datatalian Mar 27 '23 at 14:28
  • The code you have used should work. Provide additional details or try to recreate the resource and try again. – Saideep Arikontham Mar 28 '23 at 03:37

2 Answers2

0

I tried the above in my environment and withColumnRenamed is working fine for me.

My Code:

df2 = df2.withColumnRenamed("first name", "firstname")
pandas_df = df2.toPandas()

enter image description here

You can see the column name changed in pandas dataframe also. Try and recheck from your end again by creating new spark pool or new resource as suggested in comments.

I tried to delete and unpersist the first dataframe too but with no success.

In spark, dataframes are immutable. So, reassign the dataframe to orginal which makes the new modified dataframe. df2 = df2.withColumnRenamed("first name", "firstname")

Rakesh Govindula
  • 5,257
  • 1
  • 2
  • 11
0

I was able to solve this problem by modifying the way I'm querying the table on the dedicated sql pool.

Before:

df = spark.read.synapsesql("db.tablename")

After:

df = (spark.read
    .option(Constants.SERVER, "server_name.sql.azuresynapse.net")
    .option(Constants.DATABASE, "db_name")
    .synapsesql("select [Job Title] as JobTitle from dbo.TableName)
    )

pandas_df = df.toPandas()

Thank you for your help. I tried again with the previously shared code and with a new pool and resources but I still had the same issue. This solution allows me to modify the name of the column.

To note that:

select [Job Title] from dbo.TableName

doesn't work when transforming to pandas dataframe or simply doing:

df.show()

So the column alias without a space solves it.

Thank you.

datatalian
  • 83
  • 1
  • 5