Python / Pandas Join records when they match in at least one column

Question

I have the following dataframe:

id          phone       email
10352897        
10352897    10225967    
10352897                user@gmail.com
10352897    10225967    user@gmail.com
            10225967    
            10225967    user@gmail.com
                        user@gmail.com
23578910        
23578910    38256789    
23578910                user2@gmail.com
23578910    38256789    user2@gmail.com
            38256789    
            38256789    user2@gmail.com
                        user2@gmail.com
            65287930    user3@gmail.com
            65287930
                        user3@gmail.com
            65287930
            70203065
            70203065
            70203065
                        user4@gmail.com
                        user4@gmail.com
                        user4@gmail.com

Not all the fields are always filled in, but they are related to each other in at least one column.

I would like that when it coincides in at least one of the three columns, the record joins and prioritizes the filled fields over the empty ones, in the end in this example I would expect the following output:

id          phone       email
10352897    10225967    user@gmail.com
23578910    38256789    user2@gmail.com
            65287930    user3@gmail.com
            70203065
                        user4@gmail.com

How would you go about doing this?

the one you marked in green, doesn't follow your pattern *coincides in at least one of the three columns,* . it should split into three separate groups — Naveed, Oct 31 '22 at 23:33

score 0 · Answer 1 · answered Oct 31 '22 at 22:14

This is quite a specific requirement and I'm not aware of any in-built pandas function that does what you're looking for, so I tried to define exactly what you are looking to do and recreate it from scratch.

The best I could come up with is that reading from the top of the dataframe, we will look at the values in each of the columns until we find a row with a value in a column that is different from the previously encountered value in that column, at which point we put all the values encountered so far into a row in the new dataframe.

This looks like (assuming your original dataframe is named df and the empty cells are blanks (''):

new_rows = []
# Create a dictionary where keys are the columns of the original df
new_row = {col: '' for col in df.columns}
# Iterate over rows
for _, row in df.iterrows():
    # Iterate over columns in row
    for col in row.keys():
        # If this column is not blank
        if row[col]:
            # If this column has already been filled in the new row
            # and the value is different, add this row to the new dataframe
            if new_row[col] and new_row[col] != row[col]:
                new_rows.append(new_row)
                new_row = {col: '' for col in df.columns}
            # Otherwise, set this value for the current row of the new dataframe
            else:
                new_row[col] = row[col]
# Add the last row
new_rows.append(new_row)
new_df = pd.DataFrame(new_rows)
print(new_df)

However, 70203065 and user4@gmail.com end up in the same row in the new dataframe:

        id     phone            email
0  10352897  10225967   user@gmail.com
1  23578910  38256789  user2@gmail.com
2            65287930  user3@gmail.com
3            70203065  user4@gmail.com

You may need to think about what the logic is that causes 70203065 and user4@gmail.com to end up in separate rows, but hopefully this gets you started in the right direction.

Gonçalo Peres · Answer 2 · 2022-10-31T22:40:28.063

If the goal is solely to obtain the desired output, one approach would be to get a dataframe with unique values of each column in the original dataframe df, and for that, one can use pandas.DataFrame.apply with a custom lambda function as follows

df_new = df.apply(lambda x: pd.Series(x.unique()[~pd.isnull(x.unique())]))

[Out]:

           id       phone            email
0  10352897.0  10225967.0   user@gmail.com
1  23578910.0  38256789.0  user2@gmail.com
2         NaN  65287930.0  user3@gmail.com
3         NaN  70203065.0  user4@gmail.com

Then, even though there is room to implement a validation to check if, for a given unique value, the rest of the columns match in the df, as in this specific case the first three rows are actually correct, we won't consider that. One will simply duplicate the last row

df_new = pd.concat([df_new, df_new.iloc[-1:]], ignore_index=True)

Then adjust the respective values to obtain the desired output

df_new.iloc[-2,2] = np.nan
df_new.iloc[-1,1] = np.nan

[Out]:

           id       phone            email
0  10352897.0  10225967.0   user@gmail.com
1  23578910.0  38256789.0  user2@gmail.com
2         NaN  65287930.0  user3@gmail.com
3         NaN  70203065.0              NaN
4         NaN         NaN  user4@gmail.com

Notes:

Even though the final part is not the most elegant, and require some "manual" work (duplicating a row and manually changing the cell values), this works for OP's specific case.
There are strong opinions on using .apply(). For that, one might want to read this.

score 0 · Answer 3 · answered Oct 31 '22 at 23:53

here is one way to do it.

please note: the third section in your question, don't follow your stated pattern coincides in at least one of the three columns,

# Compare each rows with the next and identify ones where all three columns have changed
df['keep']=((df['id'].ne(df['id'].shift(-1)) &
             df['phone'].ne(df['phone'].shift(-1)) &
             df['email'].ne(df['email'].shift(-1))
            ))

# map
df['chng']=df['keep'].map({False : np.nan, True: 1})

# cumsum to find the number of groups
df['chng']=df['chng'].cumsum().bfill()

# groupby and fill values
df=df.groupby('chng', as_index=True).ffill()

# keep the ones where values have changed 
out=df.loc[df['keep']==True].fillna('')[['id','phone','email']]
out

            id  phone       email
6   10352897.0  10225967.0  user@gmail.com
13  23578910.0  38256789.0  user2@gmail.com
15              65287930.0  user3@gmail.com
16                          user3@gmail.com
17              65287930.0  
20              70203065.0  
23                          user4@gmail.com

though, if you run the above code again, it further consolidates the middle three rows into your expected resultset

            id       phone           email
6   10352897.0  10225967.0  user@gmail.com
13  23578910.0  38256789.0  user2@gmail.com
17              65287930.0  user3@gmail.com
20              70203065.0  
23                          user4@gmail.com

Python / Pandas Join records when they match in at least one column

3 Answers3