I have a three way hierarchy: property -> prov -> co. Each property has a segment, i.e. hotel / home. I have written a query to get the count of each below:
properties = spark.sql("""
SELECT
COUNT(ps.property_id) as property_count,
ps.prov_id,
c.id as co_id,
ps.segment
FROM
schema.t1 ps
INNER JOIN
schema.t2 c
ON c.id = p.co_id
GROUP BY
2,3,4
""")
properties = properties.toPandas()
This gives me the total number of properties per segment, per prov, per co. From the above df properties
, I want to create a new df which looks like this:
- prov_id,
- prov_segment,
- co_id,
- co_segment
The prov_segment
should be 'Home' if >50% of the properties in that pro_id
fall into the Home
segment, otherwise it should be Core
.
Likewise, co_segment
should be Home
if >50% of the prov_id
s fall into the Home
prov_segment, otherwise it should be core.
I know, I can get the total number of properties by grouping the data:
prop_total_count = properties.groupby('prov_name')['property_count'].sum()
However, I'm unsure how to use this to create the new dataframe.
Example data:
properties.show(6)
:
| property_count | prov_id | co_id | segment |
|----------------|---------|-------|---------|
| 10 | 1 | ABC | Core |
| 200 | 1 | ABC | Home |
| 300 | 9 | ABC | Core |
| 10 | 9 | ABC | Home |
| 100 | 131 | MNM | Home |
| 200 | 199 | KJK | Home |
Based on the above, I would want the below output:
| prov_id | prov_segment | co_id | co_segment |
|---------|--------------|-------|------------|
| 1 | Home | ABC | Core |
| 9 | Core | ABC | Core |
| 131 | Home | MNM | Home |
| 199 | Home | KJK | Home |
prov_id 1 gets a Home segment as it has 200 home properties compared to 10 core properties. prov_id 9 gets a Core segment as it has 300 core properties to 10 Home properties.
co_id ABC gets a Core segment due to the portfolio having a total 310 Core properties compared to 210 Home properties.
prov_id 131 and 199 only are in a single segment so that segment remains.