4

I have the following data (you can reproduce it by copying and pasting):

from pyspark.sql import Row
l = [Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=None), Row(value=None), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=None), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=None)]
l_df = spark.createDataFrame(l)

Let's take a look at the schema of l_df:

l_df.printSchema()

root
|-- value: boolean (nullable = true)

Now I want to use cube() to count the frequency of each distinct value in the value column:

l_df.cube("value").count().show()

But I see two types of null values!

+-----+-----+
|value|count|
+-----+-----+
| true|   67|
| null|  100|
| null|   33|
+-----+-----+

To verify that I don't actually have two types of null:

l_df.select("value").distinct().collect()

And there is indeed only one type of null:

[Row(value=None), Row(value=True)]

Just to double check:

l_df.select("value").distinct().count()

And it returns 2.

I also noticed that len(l) is 100 and the first null is equal to this number. Why is this happening?

System info: Spark 2.1.0, Python 2.7.8, [GCC 4.1.2 20070626 (Red Hat 4.1.2-14)] on linux2

versatile parsley
  • 411
  • 2
  • 6
  • 15

2 Answers2

1

These are not two types of nulls but results of different level aggregations. As explained in What is the difference between cube, rollup and groupBy operators? your cube application is equivalent to:

SELECT NULL AS value, COUNT(*) FROM df
UNION ALL
SELECT value, COUNT(*) FROM df GROUP BY value

The first query generates tuple (null, 100) (total number of records) where NULL is just a placeholder, and the second query generates tuples (true, 67), (null, 33) where NULL is one of the levels of value column.

It is easy to check with grouping (or grouping_id):

from pyspark.sql.functions import grouping, count

l_df.cube("value").agg(count("*"), grouping("value")).show()
# +-----+--------+---------------+
# |value|count(1)|grouping(value)|
# +-----+--------+---------------+
# | true|      67|              0|
# | null|     100|              1|
# | null|      33|              0|
# +-----+--------+---------------+
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
0

df.groupBy('value').count().show() will do as @pault said.

In case of cube, adding a "filter" method works for me

df.cube("value").count().filter(  col('count')<df.count()  ).show()

but an extra process gets added. See the screenhot from my work where I used cube().

See my example

user8482601
  • 67
  • 1
  • 9