I'm now reading Learning PySpark, and in the book, the author first creates a dataframe:
df_miss = spark.createDataFrame([(1, 143.5, 5.6, 28, 'M', 100000),
(2, 167.2, 5.4, 45, 'M', None),
(3, None , 5.2, None, None, None),
(4, 144.5, 5.9, 33, 'M', None),
(5, 133.2, 5.7, 54, 'F', None),
(6, 124.1, 5.2, None, 'F', None),
(7, 129.2, 5.3, 42, 'M', 76000), ],
[' id', 'weight', 'height', 'age','gender', 'income'])
then he uses this method to calculate the percentage of missing value:
df_miss.agg(*[(1 - (fn.count( c) / fn.count('*'))).alias( c + '_missing')
for c in df_miss.columns ]).show()
What are the two * for, especially the second one? Are there any resources about this kind of expression?? Thanks a lot!