I am quite new to pyspark and I have a dataset which I load from a csv (in Glue). There is one column code
where there are string
and long
.
df = glueContext.create_dynamic_frame.from_catalog(database="db", table_name="data_csv")
df.printSchema()
root
|-- code: choice
| |-- long
| |-- string
and it seems that pyspark sees the missing values as string
. I found this by flattening the column like How to unwrap nested Struct column into multiple columns?.
df_flattened.show()
+---------+------+
| long|string|
+---------+------+
| 9965213| null|
|300870254| null|
| 5607653| null|
| 5798154| null|
| 389954| null|
| 572| null|
| 951091| null|
I actually want the whole column to be string
but I could not find how to make the null
values (above) actual null
values that show when using isnan
. Also when I try to cast the whole column into string
and I find that none of the rows is == 'null'
.
df = (df
.toDF()
.withColumn('code', f.col('code').cast("string"))
)
df.select('code').where(f.col('code') == 'null').count()
0
What type are these null
values and how can I convert them to "true" null
values (that are recognized by isNull()
)?