I have the following data (you can reproduce it by copying and pasting):
from pyspark.sql import Row
l = [Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=None), Row(value=None), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=None), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=None)]
l_df = spark.createDataFrame(l)
Let's take a look at the schema of l_df
:
l_df.printSchema()
root
|-- value: boolean (nullable = true)
Now I want to use cube()
to count the frequency of each distinct value in the value
column:
l_df.cube("value").count().show()
But I see two types of null
values!
+-----+-----+
|value|count|
+-----+-----+
| true| 67|
| null| 100|
| null| 33|
+-----+-----+
To verify that I don't actually have two types of null
:
l_df.select("value").distinct().collect()
And there is indeed only one type of null
:
[Row(value=None), Row(value=True)]
Just to double check:
l_df.select("value").distinct().count()
And it returns 2
.
I also noticed that len(l)
is 100
and the first null
is equal to this number. Why is this happening?
System info: Spark 2.1.0, Python 2.7.8, [GCC 4.1.2 20070626 (Red Hat 4.1.2-14)] on linux2