Convert missing string values to isNull in Pyspark

Question

I am quite new to pyspark and I have a dataset which I load from a csv (in Glue). There is one column code where there are string and long.

df = glueContext.create_dynamic_frame.from_catalog(database="db", table_name="data_csv")
df.printSchema()

root
|-- code: choice
|    |-- long
|    |-- string

and it seems that pyspark sees the missing values as string. I found this by flattening the column like How to unwrap nested Struct column into multiple columns?.

df_flattened.show()

+---------+------+
|     long|string|
+---------+------+
|  9965213|  null|
|300870254|  null|
|  5607653|  null|
|  5798154|  null|
|   389954|  null|
|      572|  null|
|   951091|  null|

I actually want the whole column to be string but I could not find how to make the null values (above) actual null values that show when using isnan. Also when I try to cast the whole column into string and I find that none of the rows is == 'null'.

df = (df
          .toDF()
          .withColumn('code', f.col('code').cast("string"))
     )

df.select('code').where(f.col('code') == 'null').count()

0

What type are these null values and how can I convert them to "true" null values (that are recognized by isNull())?

score 0 · Answer 1 · answered Apr 30 '20 at 12:38

0

In order to deal with null values in pyspark, you can filter them by using isnull function or replace then by using na. Ex:

from pyspark.sql import functions as f
df.select([f.count(f.when(f.isnull(c), c)).alias(c) for c in df.columns])
#This will give you count of null values in each of your column

If you want to replace null values with some other value you can use

df.na.fill(yourValue)

Hope it helps.

answered Apr 30 '20 at 12:38

Shubham Jain

5,327
2
15
38

Thanks, but why do i get bigint for all of my columns with the above code? – corianne1234 Apr 30 '20 at 13:12
It's because you are letting spark inferring the schema for you and there might be some bigint values....you can explicitly pass schema to overcome this use StructType to create your own schema and pass it to the df. – Shubham Jain May 01 '20 at 01:53

Convert missing string values to isNull in Pyspark

1 Answers1