1

I have two DataFrames with many thousands of rows. The columns all have the dtype string. A snippet looks like this:

DF1:

                 ID   SUCCESSOR 
    0       0001234     3620031
    1       0001235     6640002
    2       0002456     8620003
    3       0013456     8640004
    4       1711999     1283456 <- see DF2
    ...         ...         ... 
    409813  9162467        <NA>
    409814  9212466        <NA>
    409815  9312466     6975A0C
    409816  9452463        <NA>
    409817  9591227        <NA>

DF2:

                 ID
    2       1111682
    3       1123704
    14      1567828
    15      1711999 <- that value should be replaced with '1283456'
    16      1711834
    ...         ...
    845775  970879B
    845776  975879B
    845777  9275A0A
    845778  9285A05
    845779  9295A05

Don't be surprised that the second DataFrame is missing a few indexes, because I filtered them before, as they were not relevant. Also, the NaNs are not relevant because my algorithm bypasses them.

I now want to replace the IDs in the second DataFrame with the successors in the first DataFrame that have the same ID.

The output should be:

                 ID
    2       1111682
    3       1123704
    14      1567828
    15      1283456 <- now replaced
    16      1711834
    ...         ...
    845775  970879B
    845776  975879B
    845777  9275A0A
    845778  9285A05
    845779  9295A05

In order not to blow up the example, I have replaced only one value. In reality there are several replacements

Two approaches:

  1. In my first approach I iterated through DF1 and used the replace() function, however this approach takes years, so its useless.

  2. In my second approach, I first convert DF1 to a dictionary and then apply the map() function. I do this as described here by JohnE: Remap values in pandas column with a dict and in a small example it works wonderfully:

     df = pd.DataFrame({'col1': {1: 1, 2: 2, 3: 4, 4: 1}, 'col2': {1: 2, 2: np.nan}})
     di = {1: "A", 2: "B"}
    
     col1  col2
     1     1   2.0
     2     2   NaN
     3     4   NaN
     4     1   NaN
    
     df['col1'].map(di).fillna(df['col1'])
    
     1    A
     2    B
     3    4
     4    A
    

My function to map DF1 and DF2 looks like this:

    def mapping(df1, df2):   

        di =dict(zip(df1.ID, df1.SUCCESSOR)) # create the dict
        changes = 1   

        while(changes > 0):        
    
            changes = 0
            df_old = df2                    
            print(df2) #check how df2 looks before mapping.
            df2['ID'] = df2['ID'].map(di).fillna(df2['ID'])                   
            print(df2) # check how df2 looks after mapping. Unfortunately no changes :( so the error must be in the mapping function one line above here.
            if df_old.equals(df2) == False:
                changes = 1    
    
        return df2

So obviously the error must be in this line:

    df2['ID'] = df2['ID'].map(dic).fillna(df2['ID']).

However, I just can't figure out why this doesn't work. What is not working here and why?

If anyone can help me, I am indebted to them forever!!!

Best regards, Alfonso

EDIT: EDIT: I found the error and I am an idiot. My solution worked, but the line: "df_old = df2" prevented the loop from continuing. MANY THANKS anyway and sorry if I have robbed time!

Alfonso
  • 11
  • 4

1 Answers1

1

Here is a oneliner that creates the replacement dict by filtering the dataframes:

df2['ID'] = df2['ID'].replace(dict(zip(df2[df2['ID'].isin(df1['ID'])].sort_values(by=['ID']).reset_index()['ID'], df1.loc[df1['ID'].isin(df2['ID'])].sort_values(by=['ID']).reset_index()['SUCCESSOR'])))
RJ Adriaansen
  • 9,131
  • 2
  • 12
  • 26
  • Thank you, but even this solution just does not change anything in the DF2. I do not know what the problem is. Could it be that the function is not applicable to a subset? DF2 is a subset of the original. It originally had ~ 845k IDs, but only ~490k of them are left after filtering, hence the gaps in the indices.. – Alfonso Aug 12 '21 at 21:47
  • Also, I have to add that IDs in DF2 sometimes have up to 30 successor IDs. DF1 explains a supersession of IDs, i.e. if IDs were exchanged in DF2, it is likely that they have to be exchanged again in the next iteration. Until all IDs have reached their "last successor". That's why I used a while loop in my first post. Iteration continues until no more IDs need to be exchanged, since there are no more successors in DF1. – Alfonso Aug 12 '21 at 21:51