I have data in the following format
agent_id, client_id, client_long, client_lat
1, 1, ,39.777982,-7.004599
1, 2, ,39.677982,-7.094599
1, 3, ,39.577982,-7.084599
2, 4, ,39.477982,-7.074599
2, 5, ,39.377982,-7.064599
I want to get the average distance between the clients for each agent
so I need to get the distances between clients 1,2,3 (all combinations) for agent 1 and distances between clients 4 and 5 for agent 2 then average these distances for each agent.
How do I go about doing this using pyspark?