I have a pyspark dataframe where i am finding out min/max values and count of min/max values for each columns. I am able to select min/max values using:
df.select([min(col(c)).alias(c) for c in df.columns])
I want to have the count of min/max values as well in same dataframe.
Specific output I need:
...| col_n | col_m |...
...| xn | xm |... min(col(coln))
...| count(col_n==xn) | count(col_m==xm) |...
Asked
Active
Viewed 822 times
0

HarshR
- 75
- 1
- 9
-
Can you post your expected output – Shubham Jain Jul 06 '20 at 10:55
1 Answers
0
try this,
from pyspark.sql.functions import udf,struct,array
from pyspark.sql.window import Window
tst= sqlContext.createDataFrame([(1,7,2,11),(1,3,4,12),(1,5,6,13),(1,7,8,14),(2,9,10,15),(2,11,12,16),(2,13,14,17)],schema=['col1','col2','col3','col4'])
expr=[F.max(coln).alias(coln+'_max') for coln in tst.columns]
tst_mx = tst.select(*expr)
#%%
tst_dict = tst_mx.collect()[0].asDict()
#%%
expr1=( [F.count(F.when(F.col(coln)==tst_dict[coln+'_max'],F.col(coln))).alias(coln+'_max_count') for coln in tst.columns])
#%%
tst_res = tst.select(*(expr+expr1))
In the expr, I have just tried out for max function. you can scale this to other functions like min,mean etc., even use a list comprehension for the function list. Refer this answer for such a scaling : pyspark: groupby and aggregate avg and first on multiple columns It is explained for aggregate, can also be done to select statement.

Raghu
- 1,644
- 7
- 19
-
This will give count of entire column. I want count of min/max values in column. – HarshR Jul 06 '20 at 10:26
-
-
Thanks for the answer. I was looking for some answer which doesnt involve collect statement (maybe we can use subquery, but dont know how to use it in pyspark). – HarshR Jul 06 '20 at 14:43
-
ok, not sure about that. Since only you are collecting summary data, the memory shouldnt be a big problem in case of collect. – Raghu Jul 06 '20 at 15:23
-
Possible answer : https://stackoverflow.com/questions/33882894/spark-sql-apply-aggregate-functions-to-a-list-of-columns – dsk Jul 07 '20 at 15:54