I have an old dataframe (dfo) that someone decided to add additional columns (notes) to and this data set does not have a key column. I also have a new dataframe (dfn) that is suppose to represent the same data but does not have the notes columns. I was asked just to transfer the old notes to the new dataframe. I have been able to get matches for some rows but not all. What I want to is to find if there are additional tricks to merging on multiple columns or is there alternatives that might fit better.
below is example data from the original csv that did not merge then placing it in the Dictionaries it works just fine.
example_new = {'S1': ['W', 'CD', 'W', 'W', 'CD', 'W', 'CD'],
'DateTime': ['6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26'],
'N1': ['N', 'Y', 'N', 'Y', 'N', 'N', 'N'],
'AC': ['C253', '100', '1064', '1920', '1996', '100', 'C253'],
'PS': ['C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041'],
'TI': ['14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP'],
'U': [' ', 'N', 'U/C', 'T', 'C', 'N', 'P'],
'LN': ['Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie'],
'R2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}
example_old = {'S1': ['W', 'W', 'W', 'W', 'CD', 'CD'],
'DateTime': ['6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26'],
'N1': ['N', 'Y', 'N', 'N', 'N', 'Y'],
'AC': ['1064', '1920', 'C253', '100', 'C253', '100'],
'PS': ['C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041'],
'TI': ['14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP'],
'U': ['U/C', 'T', ' ', 'N', 'P', 'N'],
'LN': ['Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie'],
'R2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Note1': ['Y', 'Y', 'Y', 'Y', 'N', 'N']}
dfo = pd.DataFrame.from_dict(example_old)
dfn = pd.DataFrame.from_dict(example_new)
dfn['DateTime'] = pd.to_datetime(dfnt['DateTime'])
dfo['DateTime'] = pd.to_datetime(dfot['DateTime'])
The code:
dfo = dfo # shape (10250, 10) the additional columns are notes.
# columns: ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2', 'Note1']
dfn = dfn # shape (13790, 9) there are a lot or corrects to the prior data
# and additional new data.
# columns: ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
# to make sure that the dtypes are the same.
# I read that making sure the object columns are all strings works better. Really Good tip!!
str_col = ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
dfo[str_col] = dfo[str_col].astype(str)
dfn[str_col] = dfn[str_col].astype(str)
dfo = dfo.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
dfn = dfn.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
# I read that encoding the columns might show characters that are hidden.
# I did not find this helpful for my data.
# u = dfo.select_dtypes(include=[object])
# dfo[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))
# u = dfn.select_dtypes(include=[object])
# dfn[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))
# test / check the dtypes
otypes = dfo.columns.to_series().groupby(dfo.dtypes).groups
ntypes = dfn.columns.to_series().groupby(dfn.dtypes).groups
# display results... dtypes
In [95]: print(otypes)
Out[74]: {datetime64[ns]: ['DateTime'],
object: ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2', 'Note1']}
In [82]: print(ntypes)
Out[82]: {datetime64[ns]: ['DateTime'],
object: ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']}
# Time to merge
subset = ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
dfm = pd.merge(dfn,dfo, how="left", on=subset)
About 75% of the data is merging. I have done spot checks and there is a lot more data that could merge but it is not. What else should I do to get the remaining 15~25% to merge? If you want to see the data in the csv file I have included a link. Github to csv files