How to specify column types when using spark.write.saveAsTable(tablename) from .csv in pyspark

Question

I am trying to save a new table from a csv. Unfortunately the way the csv is read and saved, all column types are string. The dataset contains other types and I want to specify the types when creating the table.

I have found a solution to alter the column types after creating the table, but it doesnt seem practical.

This is how I create the table:

from pyspark.sql import DataFrame

import_path = f"{st_raw}/data.csv"

sparkDF = spark.read.csv(import_path, header=True)

spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{schema}")

tablename = f"{catalog}.{schema}.{table}"
sparkDF.write.saveAsTable(tablename)

assert spark.table(tablename).count() > 0

display(spark.table(tablename))

Printing the schema shows, that all columns are of type string:

 |-- Date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Temperature: string (nullable = true)
 |-- CO2 Emissions: string (nullable = true)
 |-- Sea Level Rise: string (nullable = true)
 |-- Precipitation: string (nullable = true)
 |-- Humidity: string (nullable = true)
 |-- Wind Speed: string (nullable = true)

I need to specifiy the correct types. How can I accomplish that?

You can try 2 things, either setting inferSchema to true https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/#inferschema. Or to specify the CSV schema explicitly https://docs.databricks.com/en/_extras/notebooks/source/read-csv-schema.html — abiratsis, Aug 07 '23 at 14:55

score 1 · Accepted Answer · answered Aug 07 '23 at 10:16

1

Normally when reading in a csv, you can use the inferSchema option to infer the types of your columns. As is explained here, it is set by defaut to false. So your initial dataframe should look something like:

sparkDF = spark.read.option("inferSchema",True).csv(import_path, header=True)

answered Aug 07 '23 at 10:16

Michael Viaene

26
1

Thanks Michael, that was the solution to my problem. The inferSchema option worked perfectly and assigned the correct datatypes. Thanks. – Pfinnn Aug 07 '23 at 13:23

score 0 · Answer 2 · answered Aug 07 '23 at 10:33

0

As you are reading the CSV you can specify the schema with the schema function. That link has a full example.

answered Aug 07 '23 at 10:33

Chris

1,240
7
8

How to specify column types when using spark.write.saveAsTable(tablename) from .csv in pyspark

2 Answers2