How to typecast Spark DataFrame columns? Using pyspark

Question

I have created a DataFrame in the following way:

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

df = spark.read.csv("train.csv", header=True)

The schema for my DataFrame is as follows:

root
 |-- PassengerId: string (nullable = true)
 |-- Survived: string (nullable = true)
 |-- Pclass: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- SibSp: string (nullable = true)
 |-- Parch: string (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: string (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)

How do I change the data types of each column of my DataFrame?

I know that I can specify the schema option in the call to csv(), but I want to change the data types after at a later stage.

Some of the columns have missing values. How do Spark DataFrames deal with missing values?

Andrea · Accepted Answer · 2018-10-18T10:38:07.837

To change the datatype you can for example do a cast. For example, consider the iris dataset where SepalLengthCm is a column of type int. If you want to cast that int to a string, you can do the following:

df.withColumn('SepalLengthCm',df['SepalLengthCm'].cast('string'))

Of course, you can do the opposite from a string to an int, in your case. You can alternatively access to a column with a different syntax:

df.withColumn('SepalLengthCm',df.SepalLengthCm.cast('string'))

Or, importing from pyspark.sql.functions import col you could do (without dealing directly with the df):

df.withColumn('SepalLengthCm',col('SepalLengthCm').cast('string'))

You can deal with null values using df.na.drop(how='any', thresh=None, subset=None) (or df.dropna()). This is the doc page where you can check the meaning of the parameters.

Returns a new DataFrame omitting rows with null values. DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other. Parameters:

how – ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null.

thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter.

subset – optional list of column names to consider.

You can also choose to assign a specific value if you meet null values. This time you should use df.na.fill(value, subset=None) (or df.fillna()). Here is the doc page.

Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other. Parameters:

value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string.

subset – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.

Thank you Andrea. I still don't understand why it is so hard to find this answer. — Slak, Oct 13 '19 at 10:28

Shantanu Sharma · Answer 2 · 2019-03-30T19:49:31.937

0

You can also try this-

df1 = df.select(df.column1.cast("float"), df.column2.cast("integer"))

edited Mar 30 '19 at 19:49

answered Oct 18 '18 at 10:13

Shantanu Sharma

3,661
1
18
39

How to typecast Spark DataFrame columns? Using pyspark

2 Answers2