How to insert average value in to particular column base on group by on 4 columns using Spark or pandas DataFrame?

Question

I have following values in spark dataframe

+---------+------+----+----------+-----------+-----------+--------------+
|store_nbr|metric|goal|time_frame|time_period|fiscal_year|       channel|
+---------+------+----+----------+-----------+-----------+--------------+
|     1000|   NPS|  53| Half Year|         H1|       2023|         store|
|     1001|   NPS|  81| Half Year|         H1|       2023|ecomm_combined|
|     1003|   NPS|  65| Half Year|         H1|       2023|ecomm_combined|
|     1004|   NPS|  85| Half Year|         H1|       2023|         store|
|     1007|   NPS|  53| Half Year|         H1|       2023|         store|
|     1008|   NPS|null| Half Year|         H1|       2023|         store|
|     1009|   NPS|  85| Half Year|         H1|       2023|         store|
|     1011|   NPS|  72| Half Year|         H1|       2023|         store|
|     1012|   NPS|  71| Half Year|         H1|       2023|ecomm_combined|
|     1013|   NPS|  52| Half Year|         H1|       2023|ecomm_combined|
|     1014|   NPS|null| Half Year|         H1|       2023|ecomm_combined|
|     1016|   NPS|  54| Half Year|         H1|       2023|ecomm_combined|
|     1017|   NPS|  69| Half Year|         H1|       2023|ecomm_combined|
|     1018|   NPS|  93| Half Year|         H1|       2023|ecomm_combined|
|     1020|   NPS|  93| Half Year|         H1|       2023|         store|
|     1022|   NPS|  95| Half Year|         H1|       2023|         store|
|     1023|   NPS|  86| Half Year|         H1|       2023|ecomm_combined|
|     1025|   NPS|  72| Half Year|         H1|       2023|ecomm_combined|
|     1026|   NPS|  70| Half Year|         H1|       2023|ecomm_combined|
|     1027|   NPS|null| Half Year|         H1|       2023|ecomm_combined|
|     1028|   NPS|  63| Half Year|         H1|       2023|ecomm_combined|
|     1029|   NPS|  66| Half Year|         H1|       2023|ecomm_combined|
|     1030|   NPS|  86| Half Year|         H1|       2023|ecomm_combined|
|     1031|   NPS|  61| Half Year|         H1|       2023|ecomm_combined|
|     1032|   NPS|  96| Half Year|         H1|       2023|ecomm_combined|
|     1033|   NPS|  91| Half Year|         H1|       2023|ecomm_combined|
|     1034|   NPS|  79| Half Year|         H1|       2023|ecomm_combined|
|     1035|   NPS|  53| Half Year|         H1|       2023|ecomm_combined|
|     1036|   NPS|null| Half Year|         H1|       2023|         store|

and my average calculation dataframe looks like -

goal = raw_df.groupBy('metric','time_frame', 'time_period','fiscal_year','channel').mean('goal')

+------+----------+-----------+-----------+--------------+-----------------+
|metric|time_frame|time_period|fiscal_year|       channel|        avg(goal)|
+------+----------+-----------+-----------+--------------+-----------------+
|  null|      null|       null|       null|          null|             null|
|   NPS| Half Year|         H1|       2023|ecomm_combined|75.24033149171271|
|   NPS| Half Year|         H1|       2023|         store|             78.0|
+------+----------+-----------+-----------+--------------+-----------------+

So I want to insert this calculated avg value for goal column in raw_df data where there is null in that column(data type doesn't matter) .Group by metric, time_frame, time_period, fiscal_year, channel these columns. How can I do this in Spark or may in Pandas dataframe.

score 1 · Answer 1 · answered Feb 02 '23 at 05:38

With pandas: groupby.transform and boolean indexing:

cols = ['metric','time_frame', 'time_period','fiscal_year','channel']

raw_df.loc[raw_df['goal'].isna(), 'goal'] = (raw_df.groupby(cols)
                                             ['goal'].transform('mean')
                                             )

Or with fillna:

cols = ['metric','time_frame', 'time_period','fiscal_year','channel']

raw_df['goal'] = raw_df['goal'].fillna(raw_df.groupby(cols)
                                       ['goal'].transform('mean')
                                       )

Output:

    store_nbr metric       goal time_frame time_period  fiscal_year         channel
0        1000    NPS  53.000000  Half Year          H1         2023           store
1        1001    NPS  81.000000  Half Year          H1         2023  ecomm_combined
2        1003    NPS  65.000000  Half Year          H1         2023  ecomm_combined
3        1004    NPS  85.000000  Half Year          H1         2023           store
4        1007    NPS  53.000000  Half Year          H1         2023           store
5        1008    NPS  76.571429  Half Year          H1         2023           store
6        1009    NPS  85.000000  Half Year          H1         2023           store
7        1011    NPS  72.000000  Half Year          H1         2023           store
8        1012    NPS  71.000000  Half Year          H1         2023  ecomm_combined
9        1013    NPS  52.000000  Half Year          H1         2023  ecomm_combined
10       1014    NPS  72.666667  Half Year          H1         2023  ecomm_combined
11       1016    NPS  54.000000  Half Year          H1         2023  ecomm_combined
12       1017    NPS  69.000000  Half Year          H1         2023  ecomm_combined
13       1018    NPS  93.000000  Half Year          H1         2023  ecomm_combined
14       1020    NPS  93.000000  Half Year          H1         2023           store
15       1022    NPS  95.000000  Half Year          H1         2023           store
16       1023    NPS  86.000000  Half Year          H1         2023  ecomm_combined
17       1025    NPS  72.000000  Half Year          H1         2023  ecomm_combined
18       1026    NPS  70.000000  Half Year          H1         2023  ecomm_combined
19       1027    NPS  72.666667  Half Year          H1         2023  ecomm_combined
20       1028    NPS  63.000000  Half Year          H1         2023  ecomm_combined
21       1029    NPS  66.000000  Half Year          H1         2023  ecomm_combined
22       1030    NPS  86.000000  Half Year          H1         2023  ecomm_combined
23       1031    NPS  61.000000  Half Year          H1         2023  ecomm_combined
24       1032    NPS  96.000000  Half Year          H1         2023  ecomm_combined
25       1033    NPS  91.000000  Half Year          H1         2023  ecomm_combined
26       1034    NPS  79.000000  Half Year          H1         2023  ecomm_combined
27       1035    NPS  53.000000  Half Year          H1         2023  ecomm_combined
28       1036    NPS  76.571429  Half Year          H1         2023           store

Thanks, this is working fine I will do more test changing the input data — Big data Pyspark, Feb 02 '23 at 05:43
how to apply round function up to 2 decimal while inserting data on calculated mean ?@mozway — Big data Pyspark, Mar 16 '23 at 16:08

How to insert average value in to particular column base on group by on 4 columns using Spark or pandas DataFrame?

1 Answers1