0

I have created a DataFrame in the following way:

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

df = spark.read.csv("train.csv", header=True)

The schema for my DataFrame is as follows:

root
 |-- PassengerId: string (nullable = true)
 |-- Survived: string (nullable = true)
 |-- Pclass: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- SibSp: string (nullable = true)
 |-- Parch: string (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: string (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)

How do I change the data types of each column of my DataFrame?

I know that I can specify the schema option in the call to csv(), but I want to change the data types after at a later stage.

Some of the columns have missing values. How do Spark DataFrames deal with missing values?

matt
  • 515,959
  • 87
  • 875
  • 1,141
power
  • 1,680
  • 3
  • 18
  • 30

2 Answers2

8

To change the datatype you can for example do a cast. For example, consider the iris dataset where SepalLengthCm is a column of type int. If you want to cast that int to a string, you can do the following:

df.withColumn('SepalLengthCm',df['SepalLengthCm'].cast('string'))

Of course, you can do the opposite from a string to an int, in your case. You can alternatively access to a column with a different syntax:

df.withColumn('SepalLengthCm',df.SepalLengthCm.cast('string'))

Or, importing from pyspark.sql.functions import col you could do (without dealing directly with the df):

df.withColumn('SepalLengthCm',col('SepalLengthCm').cast('string'))

You can deal with null values using df.na.drop(how='any', thresh=None, subset=None) (or df.dropna()). This is the doc page where you can check the meaning of the parameters.

Returns a new DataFrame omitting rows with null values. DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other. Parameters:

  • how – ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null.
  • thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter.
  • subset – optional list of column names to consider.

You can also choose to assign a specific value if you meet null values. This time you should use df.na.fill(value, subset=None) (or df.fillna()). Here is the doc page.

Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other. Parameters:

  • value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string.
  • subset – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.
Andrea
  • 4,262
  • 4
  • 37
  • 56
  • 1
    Thank you Andrea. I still don't understand why it is so hard to find this answer. – Slak Oct 13 '19 at 10:28
0

You can also try this-

df1 = df.select(df.column1.cast("float"), df.column2.cast("integer"))
Shantanu Sharma
  • 3,661
  • 1
  • 18
  • 39