0

I have a dataframe df (see image below) which I need to merge with N dataframes.
In this post, for the sake of clarity, N=3.

The goal is to check if every value of the column Id exists in the three other dataframes and have the same value as well. And if so, the row has to be highlighted with a green color. That's it !

enter image description here

Code :

import pandas as pd
import numpy as np

### --- Dataframes

df = pd.DataFrame({'Id' : ['AA', 'BB', 'CC', 'DD', 'EE'],
                   'Value': ['three', 'two', 'five','four', 'one']})
df1 = pd.DataFrame({'Id1' : [np.nan, 'CC', 'BB', 'DD', np.nan],
                    'Value1' : ['one', 'four', 'two', np.nan, np.nan]})
df2 = pd.DataFrame({'Id2' : ['AA', 'BB', 'CC', 'DD', 'JJ'],
                   'Value2' : [np.nan, 'two', 'five', np.nan, 'six']})
df3 = pd.DataFrame({'Id3' : ['FF', 'HH', 'CC', 'GG', 'BB'],
                   'Value3' : ['seven', 'five', 'one','three', 'two']})

### --- Joining df to df1, df2 and df3

df_df1 = df.merge(df1, left_on='Id', right_on='Id1', how='left')
df_df1_df2 = df_df1.merge(df2, left_on='Id', right_on='Id2', how='left')
df_df1_df2_df3 = df_df1_df2.merge(df3, left_on='Id', right_on='Id3', how='left')

### --- Creating a function to highlight the aligned rows

def highlight_aligned_row(x):
    
    m1 = (x['Id'] == x['Id1']) &  (x['Id'] == x['Id2']) & (x['Id'] == x['Id3'])
    m2 = (x['Value'] == x['Value1']) &  (x['Value']== x['Value2']) & (x['Value'] == x['Value3'])

    df = pd.DataFrame('background-color: ', index=x.index, columns=x.columns)
    df['Id'] = np.where(m1 & m2, f'background-color: green', df['Id'])
    
    return df

>>> df_df1_df2_df3.style.apply(highlight_aligned_row, axis=None)

enter image description here

My questions are :

  1. How do we highlight the entire row when a condition is fulfilled ?
  2. Is there a more efficient way to merge 10 dataframes ?
  3. How can we check if every value/row of the original dataframe is aligned with the values of the final dataframe (after the merge) ?

Thank you in advance for your suggestions and your help !

Timeless
  • 22,580
  • 4
  • 12
  • 30
  • Hi, could you maybe clarify your third question: "How can we check if a value is aligned in every df ?" Not sure what you mean by this. – ouroboros1 Aug 20 '22 at 19:40
  • Hi @ouoboros1, I edited my question to be more precise. Check it out ! Thank you. – Timeless Aug 20 '22 at 19:51
  • So, you mean: how to ensure that e.g. `BB` for some `df` doesn't shift to row index `2` instead of index `1` with all the others? – ouroboros1 Aug 20 '22 at 19:53
  • `BB` in my example is the only Id that is aligned with all the other dataframes. It's not really an index issue. `BB` can have different indexes in df1, df2.. What matters is that `BB` exists in them, and has the right value. I hope it's more clear now for you. – Timeless Aug 20 '22 at 19:57
  • Not yet clear, I'm afraid. The fact that `BB` is aligned is proven by your `m1 & m2` evaluation. This will be a `pd.Series` with `[False, True, False, False, False]`. The `True` at index `1` means the value at that index is aligned. Or do you want to *return* that value as well in a different form, like a list or something? e.g. `aligned_values = ['BB']`. – ouroboros1 Aug 20 '22 at 20:01
  • I'm sorry. No I don't want to return a list. I would like to re-write my `m1 & m2` evaluation in a different form (mine is too long/ugly) and after that, I'm looking for a way to highlight the aligned rows with a green color. – Timeless Aug 20 '22 at 20:11

1 Answers1

1

I would do it like this. Hope the comments in between make clear what I am doing. Hopefully, they also answer your questions, but let me know if anything remains unclear.

import pandas as pd
import numpy as np

df = pd.DataFrame({'Id' : ['AA', 'BB', 'CC', 'DD', 'EE'],
                   'Value': ['three', 'two', 'five','four', 'one']})
df1 = pd.DataFrame({'Id1' : [np.nan, 'BB', 'CC', 'DD', np.nan],
                    'Value1' : ['one', 'two', 'four', np.nan, np.nan]})
df2 = pd.DataFrame({'Id2' : ['AA', 'BB', 'CC', 'DD', 'JJ'],
                   'Value2' : [np.nan, 'two', 'five', np.nan, 'six']})
df3 = pd.DataFrame({'Id3' : ['FF', 'BB', 'CC', 'GG', 'HH'],
                   'Value3' : ['seven', 'two', 'one','v4', 'v5']})

# *IF* your dfs (like above) are all same shape with same index, then easiest is to
# collect your dfs in a list and use pd.concat along axis 1 to merge:

# dfs = [df, df1, df2, df3]
# df_all = pd.concat(dfs, axis=1)

# *BUT* from your comments, this does not appear to be the case. Then instead, 
# again collect dfs in a list, and merge them with df in a loop

dfs = [df, df1, df2, df3]

for idx, list_df in enumerate(dfs):
    if idx == 0:
        df_all = list_df
    else:
        df_all = df_all.merge(list_df, left_on='Id', 
                              right_on=[col for col in list_df.columns 
                                        if col.startswith('Id')][0], 
                              how='left')
        
def highlight_aligned_row(x):

    n1 = x.loc[:,[col for col in x.columns 
                  if col.startswith('Id')]].eq(x.loc[:, 'Id'], axis=0).all(axis=1)
    m1 = x.loc[:,[col for col in x.columns 
                  if col.startswith('Value')]].eq(x.loc[:, 'Value'], axis=0).all(axis=1)
    
    eval_bool = n1 & m1
    # Just for x['Id']: [False, True, False, False, False]
    
    # repeat 8 times (== len(df.columns)) will lead to .shape == (40,). 
    # reshape to 5 rows (== len(df)) and 8 cols. Second row will be [8x True] now, 
    # other rows all 8x False
    rows, cols = len(eval_bool), len(x.columns) # 5, 8
    eval_bool_repeated = eval_bool.to_numpy().repeat(cols).reshape(rows,cols)
    
    # setup your df
    df = pd.DataFrame('background-color: ', index=x.index, columns=x.columns)
    
    # now apply eval_bool_repeated to entire df, not just df['Id']
    df = np.where(eval_bool_repeated, f'background-color: green', df)

    return df

Result:

ouroboros1
  • 9,113
  • 3
  • 7
  • 26
  • 1
    Thank you so much @ouroboros1. I have a question though. `pd.concat()` seems to be so powerful but what if my dataframes (df1, df2, ..) have different number of rows and columns ? Will your code still work ? Another thing, in that case, should I change manually this part of the code `.repeat(8).reshape(5,8)` ? – Timeless Aug 20 '22 at 20:58
  • 1
    I've updated the answer to turn `5, 8` into `rows, cols` variables, so that it becomes flexible. As for `pd.concat()`, yes, it's very powerful indeed. But it can also be tricky if you're not entirely sure about the shape (and index) of your different dfs. Can lead to unexpected behaviour. SO has quite a couple of excellent reads on `concat, merge, join`. Might want to check out https://stackoverflow.com/questions/53645882/pandas-merging-101 and https://stackoverflow.com/questions/49620538/what-are-the-levels-keys-and-names-arguments-for-in-pandas-concat-functio to learn more. – ouroboros1 Aug 20 '22 at 21:27
  • 1
    Thank you so much @ouroboros ! I'll check the posts you shared and try to manipulate `concat, merge, join` and see their behaviours. You've been very helpful. – Timeless Aug 20 '22 at 21:32