0

How can I merge between one dataframe based on the other lookup dataframe.

This is dataframe A where I want to replace the values :

  InfoType  IncidentType    DangerType
0   NaN          A             NaN
1   NaN          C             NaN
2   NaN          B            C
3   NaN          B            NaN

This is the lookup table :

    ID  ParamCode   ParamValue  ParmDesc1   ParamDesc2  SortOrder   ParamStatus
0   1   IncidentType    A       ABC            DEF          1            1
1   2   IncidentType    B       GHI            JKL          2            1
2   3   IncidentType    C       MNO            PQR          7            1
2   3   DangerType      C       STU            VWX          6            1

The expected input :

  InfoType  IncidentType    DangerType
0   NaN          ABC           NaN
1   NaN          MNO           NaN
2   NaN          GHI           STU
3   NaN          GHI           NaN

Note that ParamCode is the column names and I need to replace ParamDesc1 into respective columns in dataframe A. Every column in dataframe A may have NaN and I don't intend to remove them. Just ignore them.

This is what I have done :

ntf_cols = ['InfoType','IncidentType','DangerType']
for c in ntf_cols:
    if (c in ntf.columns) & (c in param['ParamCode'].values):
        paramValue = param['ParamValue'].unique()
        for idx, pv in enumerate(paramValue):
            ntf['NewIncidentType'] = pd.np.where(ntf.IncidentType.str.contains(pv), param['ParmDesc1'].values, "whatever")

Error :

ValueError: operands could not be broadcast together with shapes (25,) (13,) ()

dee
  • 39
  • 1
  • 8
  • This is a common duplicate. See, e.g., https://stackoverflow.com/questions/36413993/replace-column-values-in-one-dataframe-by-values-of-another-dataframe – Eric Truett Apr 21 '20 at 02:47
  • @Eric Truett, Before this gets voted closed as duplicate, is there a better matching question? OP appears to need to reference the values of two columns in the other df not one. @ dee would improve your question some if your provided what you have tried thus far to solve this problem. – Phillyclause89 Apr 21 '20 at 02:53
  • How is this a duplicate? @EricTruett My question needs to compare between a column with values within the dataframe. It is quite different. It is not between two columns. – dee Apr 21 '20 at 02:58
  • @Phillyclause89 I tried multiple times but I keep deleting them. I cant wrap my head around since the comparison is between columns name and values in another dataframe. I am not sure how to map them in this case. – dee Apr 21 '20 at 03:00
  • @Phillyclause89 At first I was planning to convert the lookup table as dictionary. But then I dont see the benefits here.. – dee Apr 21 '20 at 03:01
  • 1
    @Phillyclause89 updated with code I've been trying to do. – dee Apr 21 '20 at 03:26

2 Answers2

0

EDIT: Lambda's answer gave me an idea for how you could do this for many columns that you want to apply this logical pattern to:

import pandas as pd

df1 = pd.DataFrame(dict(
    InfoType = [None, None, None, None],
    IncidentType = 'A C B B'.split(),
    DangerType = [None, None, 'C', None],
))

df2 = pd.DataFrame(dict(
    ParamCode = 'IncidentType IncidentType IncidentType DangerType'.split(),
    ParamValue  = 'A B C C'.split(),
    ParmDesc1 = 'ABC GHI MNO STU'.split(),
))


for col in df1.columns[1:]:
    dict_map = dict(
        df2[df2.ParamCode == col][['ParamValue','ParmDesc1']].to_records(index=False)
    )
    df1[col] = df1[col].replace(dict_map)

print(df1)

This assumes every column after the first column in df1 is one that needs updating and the to-be updated column names exists as a values in the 'ParamCode' column of df2.

Python tutor link to code


This problem could be solved using some custom functions and pandas.Series.apply():

import pandas as pd

def find_incident_type(x):
    if pd.isna(x):
        return x
    return df2[
        (df2['ParamCode'] == 'IncidentType') & (df2['ParamValue']==x)
    ]["ParmDesc1"].values[0]


def find_danger_type(x):
    if pd.isna(x):
        return x
    return df2[
        (df2['ParamCode'] == 'DangerType') & (df2['ParamValue']==x)
    ]["ParmDesc1"].values[0]


df1 = pd.DataFrame(dict(
    InfoType = [None, None, None, None],
    IncidentType = 'A C B B'.split(),
    DangerType = [None, None, 'C', None],
))

df2 = pd.DataFrame(dict(
    ParamCode = 'IncidentType IncidentType IncidentType DangerType'.split(),
    ParamValue  = 'A B C C'.split(),
    ParmDesc1 = 'ABC GHI MNO STU'.split(),
))

df1['IncidentType'] = df1['IncidentType'].apply(find_incident_type)
df1['DangerType'] = df1['DangerType'].apply(find_danger_type)

print(df1)

step through the code in python tutor

It is very possible there is a more efficient way to do this. Hopefully some one who knows it will share it.

Also the ref to df2 from the outer scope is hard coded into the custom functions and thus will only work for that variable name from the outer scope. You'll need to use an argument for pandas.Series.apply's args param if you don't want these functions to be dependent on that ref.

Phillyclause89
  • 674
  • 4
  • 12
  • 1
    Thank you very much. I need to look at this soon! But I try to avoid `apply` because of performance issue. I'm still trying to write up my own and do comparison with yours as yoy stated above, I can't hard code the columns, as I have other columns to transform as well. – dee Apr 21 '20 at 04:18
  • @dee My idea here was to make a custom function for each column in `df1` that you need to replace values for based on specific values from `df2` (your lookup table) you then call the `pandas.Series.apply` method with the respective function for each column that needs values replaced. – Phillyclause89 Apr 21 '20 at 04:27
  • Hey, I also tried your second solution and it worked just as expected! Thank you! – dee Apr 22 '20 at 00:55
0

Use the lookup table to make a dict, and then replace the column values of the original dataframe. Assume the original dataframe is df1 and the lookup table is df2

...
dict_map = dict(zip(df2.ParamCode + "-" + df2.ParamValue, df2.ParmDesc1))

df1['IncidentType'] = ("IncidentType" +'-'+ df1.IncidentType).replace(dict_map)
df1['DangerType'] = ("DangerType" +'-'+ df1.DangerType).replace(dict_map)
...
Lambda
  • 1,392
  • 1
  • 9
  • 11
  • this seems to work. I have a question how can I replace `df1.IncidentType` with variable. Rather than hardcode it? Because I need to apply this to every column. I can show you my code. – dee Apr 21 '20 at 06:08
  • use a for loop through the columns you want to apply, something like `for col in df1.columns: df1[col] = (col+"-"+df1[col]).replace(dict_map)` – Lambda Apr 21 '20 at 06:13
  • Is there some columns which dtype is float64, you should change it to str. `for col in df1.columns: df1[col] = (col+"-"+df1[col].astype(str)).replace(dict_map) ` – Lambda Apr 21 '20 at 06:18