To change the datatype you can for example do a cast
. For example, consider the iris
dataset where SepalLengthCm
is a column of type int
. If you want to cast that int to a string, you can do the following:
df.withColumn('SepalLengthCm',df['SepalLengthCm'].cast('string'))
Of course, you can do the opposite from a string
to an int
, in your case. You can alternatively access to a column with a different syntax:
df.withColumn('SepalLengthCm',df.SepalLengthCm.cast('string'))
Or, importing from pyspark.sql.functions import col
you could do (without dealing directly with the df
):
df.withColumn('SepalLengthCm',col('SepalLengthCm').cast('string'))
You can deal with null values using df.na.drop(how='any', thresh=None, subset=None)
(or df.dropna()
). This is the doc page where you can check the meaning of the parameters.
Returns a new DataFrame omitting rows with null values. DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other.
Parameters:
- how – ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null.
- thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter.
- subset – optional list of column names to consider.
You can also choose to assign a specific value if you meet null values. This time you should use df.na.fill(value, subset=None)
(or df.fillna()
). Here is the doc page.
Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.
Parameters:
- value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string.
- subset – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.