My dataframe looks like
+-------------------------+-----+
| Title| Status|Suite|ID |Time |
+------+-------+-----+----+-----+
|KIM | Passed|ABC |123 |20 |
|KJT | Passed|ABC |123 |10 |
|ZXD | Passed|CDF |123 |15 |
|XCV | Passed|GHY |113 |36 |
|KJM | Passed|RTH |456 |45 |
|KIM | Passed|ABC |115 |47 |
|JY | Passed|JHJK |8963|74 |
|KJH | Passed|SNMP |256 |47 |
|KJH | Passed|ABC |123 |78 |
|LOK | Passed|GHY |456 |96 |
|LIM | Passed|RTH |113 |78 |
|MKN | Passed|ABC |115 |74 |
|KJM | Passed|GHY |8963|74 |
+------+-------+-----+----+-----+
which can be created using
df = sqlCtx.createDataFrame(
[
('KIM', 'Passed', 'ABC', '123',20),
('KJT', 'Passed', 'ABC', '123',10),
('ZXD', 'Passed', 'CDF', '123',15),
('XCV', 'Passed', 'GHY', '113',36),
('KJM', 'Passed', 'RTH', '456',45),
('KIM', 'Passed', 'ABC', '115',47),
('JY', 'Passed', 'JHJK', '8963',74),
('KJH', 'Passed', 'SNMP', '256',47),
('KJH', 'Passed', 'ABC', '123',78),
('LOK', 'Passed', 'GHY', '456',96),
('LIM', 'Passed', 'RTH', '113',78),
('MKN', 'Passed', 'ABC', '115',74),
('KJM', 'Passed', 'GHY', '8963',74),
],('Title', 'Status', 'Suite', 'ID','Time')
)
I need to apply group by
on ID and aggregation
on Time and in the result I need to get the Title, Status & Suite too along with ID.
My output it should be like
+-------------------------+-----+
| Title| Status|Suite| ID|Time |
+------+-------+-----+----+-----+
|KIM | Passed|ABC |123 |30.75|
|XCV | Passed|GHY |113 |57 |
|KJM | Passed|RTH |456 |70.5 |
|KIM | Passed|ABC |115 |60.5 |
|JY | Passed|JHJK |8963|74 |
|KJH | Passed|SNMP |256 |47 |
+------+-------+-----+----+-----+
I have tried the below code. But it is only giving me the values in ID in result
df.groupBy("ID").agg(mean("Time").alias("Time"))