4

I have an old dataframe (dfo) that someone decided to add additional columns (notes) to and this data set does not have a key column. I also have a new dataframe (dfn) that is suppose to represent the same data but does not have the notes columns. I was asked just to transfer the old notes to the new dataframe. I have been able to get matches for some rows but not all. What I want to is to find if there are additional tricks to merging on multiple columns or is there alternatives that might fit better.

below is example data from the original csv that did not merge then placing it in the Dictionaries it works just fine.

example_new = {'S1': ['W', 'CD', 'W', 'W', 'CD', 'W', 'CD'], 
'DateTime': ['6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26'], 
'N1': ['N', 'Y', 'N', 'Y', 'N', 'N', 'N'], 
'AC': ['C253', '100', '1064', '1920', '1996', '100', 'C253'], 
'PS': ['C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041'], 
'TI': ['14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP'], 
'U': [' ', 'N', 'U/C', 'T', 'C', 'N', 'P'], 
'LN': ['Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie'], 
'R2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}

example_old = {'S1': ['W', 'W', 'W', 'W', 'CD', 'CD'], 
'DateTime': ['6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26'],
'N1': ['N', 'Y', 'N', 'N', 'N', 'Y'], 
'AC': ['1064', '1920', 'C253', '100', 'C253', '100'], 
'PS': ['C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041'], 
'TI': ['14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP'], 
'U': ['U/C', 'T', ' ', 'N', 'P', 'N'], 
'LN': ['Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie'], 
'R2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan], 
'Note1': ['Y', 'Y', 'Y', 'Y', 'N', 'N']}

dfo = pd.DataFrame.from_dict(example_old)
dfn = pd.DataFrame.from_dict(example_new)
dfn['DateTime'] = pd.to_datetime(dfnt['DateTime'])
dfo['DateTime'] = pd.to_datetime(dfot['DateTime'])

The code:

dfo = dfo # shape (10250, 10) the additional columns are notes. 
# columns: ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2', 'Note1']

dfn = dfn # shape (13790, 9) there are a lot or corrects to the prior data 
# and additional new data.
# columns: ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']

# to make sure that the dtypes are the same.
# I read that making sure the object columns are all strings works better. Really Good tip!!
str_col = ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
dfo[str_col] = dfo[str_col].astype(str)
dfn[str_col] = dfn[str_col].astype(str)
dfo = dfo.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
dfn = dfn.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

# I read that encoding the columns might show characters that are hidden.
# I did not find this helpful for my data. 
# u = dfo.select_dtypes(include=[object])
# dfo[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))
# u = dfn.select_dtypes(include=[object])
# dfn[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))

# test / check the dtypes
otypes = dfo.columns.to_series().groupby(dfo.dtypes).groups
ntypes = dfn.columns.to_series().groupby(dfn.dtypes).groups

# display results... dtypes

In [95]: print(otypes)
Out[74]: {datetime64[ns]: ['DateTime'], 
       object: ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2', 'Note1']}

In [82]: print(ntypes)
Out[82]: {datetime64[ns]: ['DateTime'], 
       object: ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']}

# Time to merge 
subset = ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
dfm = pd.merge(dfn,dfo, how="left", on=subset)

About 75% of the data is merging. I have done spot checks and there is a lot more data that could merge but it is not. What else should I do to get the remaining 15~25% to merge? If you want to see the data in the csv file I have included a link. Github to csv files

Shane S
  • 1,747
  • 14
  • 31
  • Well, first step would be to check the rows that are not merging, what do they have in common with the dataframe you are trying to get it to merge to? Or conversely, what is causing them to not match? Then you can go from there. For example, if they're not matching because one join key is uppercase and another is lowercase, you can make them both lowercase. – Liew Xun Sep 13 '22 at 06:01
  • @LiewXun Thanks Did look at this and found nothing. It appears to match perfectly. – Shane S Sep 13 '22 at 06:04
  • 1
    Another tip, when asking questions related to pandas, refer to this https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples You don't have to share the actual data, but some sample "fake" data would be good enough to ask the question (generic column names, fake values etc. ) – Liew Xun Sep 13 '22 at 06:05
  • Would you mind build a [mcve]. Mainly we need a sample of both dataset with rows that merge and don't in a format that is loadable such as CSV. That would definitely help to solve your issue. – jlandercy Sep 13 '22 at 06:07
  • @jlandercy you make a good point. I will need to work on that. So I will have to get back to this question later with that. – Shane S Sep 13 '22 at 06:09

2 Answers2

2

I suggest to do an outer merge of both DataFrames with the indicator option set, to get the origin. Afterwards you can visualize the data interactively with plotly. Or you can group on a subset of columns and explore the differences.

Get the DataFrames: Github to csv files

import pandas as pd
import os

path = "~/Downloads/merging_dataframes-main"
dfo = pd.read_csv(os.path.expanduser(os.path.join(path, "df_old.csv")))
dfn = pd.read_csv(os.path.expanduser(os.path.join(path, "df_new.csv")))

dfn['DateTime'] = pd.to_datetime(dfn['DateTime'])
dfo['DateTime'] = pd.to_datetime(dfo['DateTime'])

Do an outer merge of both DataFrames

df_merge = pd.merge(dfo, dfn, how='outer', indicator=True)

columns_of_interest = df_merge.columns.difference(['DateTime', '_merge'])
df_merge.sort_values(['DateTime', *columns_of_interest], ignore_index=True, inplace=True)

mask_matches = df_merge._merge == 'both'

# # in case you want to rename the indicator
# df_merge._merge = df_merge._merge.cat.rename_categories({'left_only': 'old', 'right_only': 'new'})

Option 1: Visualize the merge

run the code in an jupyter notebook and zoom in to some points of interest. The y-axis are the indexes. Hover with our mouse over a point to get all parameters of a point.

import plotly.express as px

px.scatter(df_merge[~mask_matches], 
           x='DateTime', 
           color='_merge',
           hover_data=df_merge[~mask_matches].columns.difference(['DateTime', '_merge']))

interactive plotly window

Option 2: Use groupby to get the difference

groupby search for unique pairs in the defined columns. Here I use only 'DateTime' as it makes most sense for me here. But you can do it also with multiple columns. Afterwards, I create one DataFrame out of all groups which have both types listed in _merge, i.e.: left_only and right_only.

gb = df_merge[~mask_matches].groupby('DateTime')
groups = []
i = 0
for group_i in gb.groups:
    df_i = gb.get_group(group_i).copy()
    if len(gb.get_group(group_i)._merge.unique()) != 1:
        df_i['group'] = i
        i += 1
        groups.append(df_i)
        
df_same_time = pd.concat(groups)
S1 DateTime N1 AC PS TI U LN R2 Note1 _merge group
40 CD 2018-02-01 11:37:00 N 1005 C_031 15-85-SR U/L Eaton nan Y left_only 0
41 W 2018-02-01 11:37:00 N 1930 C_031 15-85-SR M/L Eaton nan nan right_only 0
46 CD 2018-02-01 15:15:00 N 1005 C_007 15-82-SR U/L Eaton nan Y left_only 1
47 W 2018-02-01 15:15:00 N 1930 C_007 15-82-SR M/L Eaton nan nan right_only 1
...

or explore the differences

for group_i in gb.groups:
    df_i = gb.get_group(group_i)
    if len(df_i._merge.unique()) != 1:
        print('\n', df_i["DateTime"].iloc[0])
        print(f'_merge\t: {df_i["_merge"].values.astype(str)}')
        for i in df_i.columns.difference(['Note1', '_merge']):
            if len(df_i[i].unique()) > 1:
                print(f'{i}\t: {df_i[i].values}')

where you see this entry quite a lot

_merge  : ['left_only' 'right_only']
AC      : ['1005' '1930']
S1      : ['CD' 'W']
U       : ['U/L' 'M/L']

and/or visualize it with plotly again

px.scatter(df_same_time, 
           x='DateTime', 
           color='_merge',
           hover_data=df_same_time.columns.difference(['DateTime', '_merge']))
Andrew
  • 817
  • 4
  • 9
1

If I understand correctly the ’25% that do not merge’ refers to the length of the data frame that results from an inner join:

print('inner', pd.merge(dfn, dfo, how="inner", on=subset).shape)  # (7445, 10)

Which means that roughly 2700 rows of the old data frame are not merged with the new data frame.

It seems to me your code works fine. It is just that these ‘missing’ rows in the old data frame do not have a counterpart in the new data frame.

If you create a data frame of data that did not merge this can be checked:

dfi = pd.merge(dfn, dfo, how="inner", on=subset)
dfi['inner'] = 1
df_o_nm = pd.merge(dfo, dfi, how="left", on=subset)
df_o_nm = df_oi.loc[df_oi['inner'] != 1][subset]  # not merged data from old data frame

Take for example the first row in df_o_nm:

S1 DateTime            N1    AC     PS        TI    U     LN   R2
CD 2018-02-01 11:37:00  N  1005  C_031  15-85-SR  U/L  Eaton  nan

If only a subset of columns is compared (specifically S1, N1, AC, PS, TI, U), exactly one potential row can be found in df_n, which has a different DateTime and LN value:

S1 DateTime             N1    AC     PS        TI    U   LN   R2
CD 2020-07-01 12:59:00  N    1005  C_031  15-85-SR  U/L  Bob  nan

Differences in the 2 data frames seem to be mostly caused by the columns DateTime, LN, PS and TI. So a more thorough (still simple) comparison can be done like this:

for index, row in df_o_nm.iterrows():
    df_sel = dfn.loc[
        (dfn['S1']==row['S1']) & 
        (dfn['N1']==row['N1']) & 
        (dfn['AC']==row['AC']) & 
        (dfn['U']==row['U'])]
    if len(df_sel) == 0:
        print('no matching data subset found.')
    else:
        print(f'{len(df_sel)} rows matching subset of columns found')
        for idx, row_sel in df_sel.iterrows():
            for col in ['DateTime', 'LN', 'PS', 'TI']:
                if row[col] != row_sel[col]:
                    print(f"{idx}:   {col}: {row[col]} --> {row_sel[col]}")
                print('---')
rosa b.
  • 1,759
  • 2
  • 15
  • 18
  • thank you. There is a typo here df_oi should be changed to df_o_nm, if understand this correctly. – Shane S Sep 19 '22 at 19:45