4

I have a scenario where I need to transform the values of a particular column based on the value present in another column in the same row, and the value in another dataframe.

Example-

print(parent_df)
       school         location      modifed_date
0      school_1       New Delhi     2020-04-06
1      school_2       Kolkata       2020-04-06
2      school_3       Bengaluru     2020-04-06
3      school_4       Mumbai        2020-04-06
4      school_5       Chennai       2020-04-06

print(location_df)
       school          location     
0      school_10       New Delhi
1      school_20       Kolkata     
2      school_30       Bengaluru
3      school_40       Mumbai       
4      school_50       Chennai

As per this use case, I need to transform the school names present in parent_df, based on the location column present in the same df, and the location property present in location_df

To achieve this transformation, I wrote the following method.

def transform_school_name(row, location_df):
    name_alias = location_df[location_df['location'] == row['location']]
    if len(name_alias) > 0:
        return location_df.school.iloc[0]
    else:
        return row['school']

And this is how I am calling this method

parent_df['school'] = parent_df.apply(UtilityMethods.transform_school_name, args=(self.location_df,), axis=1)

The issue is that for just 46K records, I am seeing the entire tranformation happening in around 2 mins, which is too slow. How can I improve the performance of this solution?

EDITED

Following is the actual scenario I am dealing with wherein there is a small tranformation that is needed to be done before we can replace the value in the original column. I am not sure if this can be done within replace() method as mentioned in one of the answers below.

print(parent_df)
       school         location                  modifed_date    type
0      school_1       _pre_New Delhi_post       2020-04-06      Govt
1      school_2       _pre_Kolkata_post         2020-04-06      Private
2      school_3       _pre_Bengaluru_post       2020-04-06      Private
3      school_4       _pre_Mumbai_post          2020-04-06      Govt
4      school_5       _pre_Chennai_post         2020-04-06      Private

print(location_df)
           school          location     type
    0      school_10       New Delhi    Govt
    1      school_20       Kolkata      Private
    2      school_30       Bengaluru    Private

Custom Method code

def transform_school_name(row, location_df):
location_values = row['location'].split('_')
name_alias = location_df[location_df['location'] == location_values[1]]
name_alias = name_alias[name_alias['type'] == location_df['type']]
if len(name_alias) > 0:
    return location_df.school.iloc[0]
else:
    return row['school']


def transform_school_name(row, location_df):
    name_alias = location_df[location_df['location'] == row['location']]
    if len(name_alias) > 0:
        return location_df.school.iloc[0]
    else:
        return row['school']

This is the actual scenario what I need to handle, so using replace() method won't help.

Mitaksh Gupta
  • 1,029
  • 6
  • 24
  • 50

4 Answers4

3

You can use map/replace:

parent_df['school'] = parent_df.location.replace(location_df.set_index('location')['school'])

Output:

      school   location modifed_date
0  school_10  New Delhi   2020-04-06
1  school_20    Kolkata   2020-04-06
2  school_30  Bengaluru   2020-04-06
3  school_40     Mumbai   2020-04-06
4  school_50    Chennai   2020-04-06
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
2

IIUC, this is more of a regex issue as the pattern doesn't match exactly. First extract the required pattern, create mapping of location in parent_df to location_df, map the values.

pat =  '.*?' + '(' + '|'.join(location_df['location']) + ')' + '.*?' 

mapping = parent_df['location'].str.extract(pat)[0].map(location_df.set_index('location')['school'])

parent_df['school'] = mapping.combine_first(parent_df['school'])
parent_df


    school      location            modifed_date    type
0   school_10   _pre_New Delhi_post 2020-04-06      Govt
1   school_20   _pre_Kolkata_post   2020-04-06      Private
2   school_30   _pre_Bengaluru_post 2020-04-06      Private
3   school_4    _pre_Mumbai_post    2020-04-06      Govt
4   school_5    _pre_Chennai_post   2020-04-06      Private
Vaishali
  • 37,545
  • 5
  • 58
  • 86
2

As I understood the edited task, the following update is to be performed:

  • for each row in parent_df,
  • find a row in location_df with matching location (a part of location column and type),
  • if found, overwrite school column in parent_df with school from the row just found.

To do it, proceed as follows:

Step 1: Generate a MultiIndex to locate school names by city and school type:

ind = pd.MultiIndex.from_arrays([parent_df.location.str
    .split('_', expand=True)[2], parent_df.type])

For your sample data, the result is:

MultiIndex([('New Delhi',    'Govt'),
            (  'Kolkata', 'Private'),
            ('Bengaluru', 'Private'),
            (   'Mumbai',    'Govt'),
            (  'Chennai', 'Private')],
           names=[2, 'type'])

Don't worry about strange first level column name (2), it will disappear soon.

Step 2: Generate a list of "new" locations:

locList = location_df.set_index(['location', 'type']).school[ind].tolist()

The result is:

['school_10', 'school_20', 'school_30', nan, nan]

For first 3 schools something has been found, for last 2 - nothing.

Step 3: Perform the actual update with the above list, through "non-null" mask:

parent_df.school = parent_df.school.mask(pd.notnull(locList), locList)

Execution speed

Due to usage of vectorized operations and lookup by index, my code runs sustantially faster that apply to each row.

Example: I replicated your parent_df 10,000 times and checked with %timeit the execution time of your code (actually, a bit changed version, described below) and mine.

To allow repeated execution, I changed both versions so that they set school_2 column, and school remains unchanged.

Your code was running 34.9 s, whereas my code - only 161 ms - 261 times faster.

Yet quicker version

If parent_df has a default index (consecutive numbers starting from 0), then the whole operation can be performed with a single instruction:

parent_df.school = location_df.set_index(['location', 'type']).school[
    pd.MultiIndex.from_arrays(
        [parent_df.location.str.split('_', expand=True)[2],
         parent_df.type])
    ]\
    .reset_index(drop=True)\
    .combine_first(parent_df.school)

Steps:

  • location_df.set_index(...) - Set index to 2 "criterion" columns.
  • .school - Leave only school column (with the above index).
  • [...] - Retrieve from it elements indicated by the MultiIndex defined inside.
  • pd.MultiIndex.from_arrays( - Create the MultiIndex.
  • parent_df.location.str.split('_', expand=True)[2] - The first level of the MultiIndex - the "city" part from location.
  • parent_df.type - The second level of the MultiIndex - type.
  • reset_index(...) - Change the MultiIndex into a default index (now the index is just the same as in parent_df.
  • combine_first(...) - Overwrite NaN values in the result generated so far with original values from school.
  • parent_df.school = - Save the result back in school column. For the test purpose, to check the execution speed, you can change it with parent_df['school_2'].

According to my assessment, the execution time is by 9 % shorter than for my original solution.

Corrections to your code

  1. Take a look at location_values[1]]. It retrieves pre segment, whereas actually the next segment (city name) should be retrieved.

  2. Ther is no need to create a temporary list, based on the first condition and then narrow down it, filtering with the second condition. Both your conditions (for equality of location and type) can be performed in a single instruction, so that execution time is a bit shorter.

  3. The value returned in the "positive" case should be from name_alias, not location_df.

So if for some reason you wanted to stay by your code, change the respective fragment to:

name_alias = location_df[location_df['location'].eq(location_values[2]) &
    location_df['type'].eq(row.type)]
if len(name_alias) > 0:
    return name_alias.school.iloc[0]
else:
    return row['school']
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
0

If I'm reading the question correctly, what you're implementing with the apply method is a kind of join operation. Pandas excels are vectorizing operations, plus its c-based implementation of join ('merge') is almost certainly faster than a python / apply based one. Therefore, I would try to use the following solution:

parent_df["location_short"] = parent_df.location.str.split("_", expand=True)[2]
parent_df = pd.merge(parent_df, location_df, how = "left", left_on=["location_short", "type"], 
                     right_on=["location", "type"], suffixes = ["", "_by_location"])

parent_df.loc[parent_df.school_by_location.notna(), "school"] = \
      parent_df.loc[parent_df.school_by_location.notna(), "school_by_location"]

As far as I could understand, it produces what you're looking for:

enter image description here

Roy2012
  • 11,755
  • 2
  • 22
  • 35