Create new column based on data in existing column

Question

I have a three way hierarchy: property -> prov -> co. Each property has a segment, i.e. hotel / home. I have written a query to get the count of each below:

properties = spark.sql("""
    SELECT
        COUNT(ps.property_id) as property_count,
        ps.prov_id,
        c.id as co_id,
        ps.segment
    FROM
        schema.t1 ps
    INNER JOIN
        schema.t2 c
        ON c.id = p.co_id
    GROUP BY
        2,3,4
""")
properties = properties.toPandas()

This gives me the total number of properties per segment, per prov, per co. From the above df properties, I want to create a new df which looks like this:

- prov_id,
- prov_segment,
- co_id,
- co_segment

The prov_segment should be 'Home' if >50% of the properties in that pro_id fall into the Home segment, otherwise it should be Core. Likewise, co_segment should be Home if >50% of the prov_ids fall into the Home prov_segment, otherwise it should be core.

I know, I can get the total number of properties by grouping the data:

prop_total_count = properties.groupby('prov_name')['property_count'].sum()

However, I'm unsure how to use this to create the new dataframe.

Example data:

properties.show(6):

| property_count | prov_id | co_id | segment |
|----------------|---------|-------|---------|
| 10             | 1       | ABC   | Core    |
| 200            | 1       | ABC   | Home    |
| 300            | 9       | ABC   | Core    |
| 10             | 9       | ABC   | Home    |
| 100            | 131     | MNM   | Home    |
| 200            | 199     | KJK   | Home    |

Based on the above, I would want the below output:

| prov_id | prov_segment | co_id | co_segment |
|---------|--------------|-------|------------|
| 1       | Home         | ABC   | Core       |
| 9       | Core         | ABC   | Core       |
| 131     | Home         | MNM   | Home       |
| 199     | Home         | KJK   | Home       |

prov_id 1 gets a Home segment as it has 200 home properties compared to 10 core properties. prov_id 9 gets a Core segment as it has 300 core properties to 10 Home properties.

co_id ABC gets a Core segment due to the portfolio having a total 310 Core properties compared to 210 Home properties.

prov_id 131 and 199 only are in a single segment so that segment remains.

UJIN · Accepted Answer · 2020-03-25T13:15:46.483

Ok, it is maybe possible to tackle this problem in a "shorter" way, but this should work. It relies on creating two other DataFrames with the segments per group (co_id or prov_id) and then merging the DataFrames at the end.

Merging a Series like co_id['co_segment'] to a DataFrame is not possible with older pandas versions so I added the .to_frame() function for compatibility purposes. With pandas version >= 0.25.1 this operation is allowed and that function call is superfluous.

NB: This code assumes the only segments are Home, Core and Managed.

import pandas as pd

properties = pd.DataFrame(data={'property_count': [10, 200, 300, 10, 100, 200], 
                                'prov_id': [1, 1, 9, 9, 131, 199], 
                                'co_id': ['ABC', 'ABC', 'ABC', 'ABC', 'MNM', 'KJK'], 
                                'segment': ['Core', 'Home', 'Core', 'Home', 'Home', 'Home']})


def get_segment(row):
    if row['home_perc'] > 0.5:
        return 'Home'
    elif row['core_perc'] > 0.5:
        return 'Core'
    else:
        return 'Managed'


def get_grouped_dataframe(properties_df, grouping_col):
    id = pd.DataFrame()
    id['total'] = properties.groupby(grouping_col)['property_count'].sum()
    id['home'] = properties[properties.segment == 'Home'].groupby(grouping_col)['property_count'].sum()
    id['core'] = properties[properties.segment == 'Core'].groupby(grouping_col)['property_count'].sum()
    id['managed'] = properties[properties.segment == 'Managed'].groupby(grouping_col)['property_count'].sum()
    id['home_perc'] = id['home'] / id['total']
    id['home_perc'] = id['home_perc'].fillna(0)
    id['core_perc'] = id['core'] / id['total']
    id['core_perc'] = id['core_perc'].fillna(0)
    id['managed_perc'] = id['core'] / id['total']
    id['managed_perc'] = id['core_perc'].fillna(0)
    id['segment'] = id.apply(get_segment, axis=1)

    return id


prov_id = get_grouped_dataframe(properties, 'prov_id')
prov_id.rename(columns={'segment': 'prov_segment'}, inplace=True)

#          total  home   core  home_perc  core_perc prov_segment
# prov_id                                                  
# 1          210   200   10.0   0.952381   0.047619         Home
# 9          310    10  300.0   0.032258   0.967742         Core
# 131        100   100    NaN   1.000000   0.000000         Home
# 199        200   200    NaN   1.000000   0.000000         Home

co_id = get_grouped_dataframe(properties, 'co_id')
co_id.rename(columns={'segment': 'co_segment'}, inplace=True)

#        total  home   core  home_perc  core_perc co_segment
# co_id                                                  
# ABC      520   210  310.0   0.403846   0.596154       Core
# KJK      200   200    NaN   1.000000   0.000000       Home
# MNM      100   100    NaN   1.000000   0.000000       Home

property_segments = properties.drop(columns=['property_count', 'segment']).drop_duplicates()

property_segments = pd.merge(property_segments, prov_id['prov_segment'].to_frame(), on='prov_id')
property_segments = pd.merge(property_segments, co_id['co_segment'].to_frame(), on='co_id')

#    prov_id co_id co_segment prov_segment
# 0        1   ABC       Core         Home
# 1        9   ABC       Core         Core
# 2      131   MNM       Home         Home
# 3      199   KJK       Home         Home

EDIT: Put repeated code in function, added Managed segment as per comment. Add extra to_frame() for compatibility purposes.

When running I get an error: `can not merge DataFrame with instance of type Series`. Does the solution need to be modified? — Someguywhocodes, Mar 25 '20 at 12:54
A quick search of that error on google leads to [this question](https://stackoverflow.com/questions/37968785/merging-two-dataframes) on StackOverflow. I updated the code following the answer I found there. Now it should work even for your older `pandas` version! — UJIN, Mar 25 '20 at 13:17
Perhaps a new question is better than a comment(?) but I was wondering if it's possible to modify the solution to be per day? I.e. if I introduce `yyyy_mm_dd` column to example `properties` df. I've been trying to modify your solution but no luck so far. — Someguywhocodes, Mar 26 '20 at 10:39

Create new column based on data in existing column

1 Answers1