Concat/merge/join two dataframes removing duplicate rows from the 2nd dataframe based on the index

Question

I have two pandas dataframes.

The first ("left") dataframe is read from a CSV file and looks like this:

                        open      high       low     close        volume
symbol  date                                                            
BTCUSDT 2017-08-17   4261.48   4485.39   4200.74   4285.08    795.150377
        2017-08-18   4285.08   4371.52   3938.77   4108.37   1199.888264
        2017-08-19   4108.37   4184.69   3850.00   4139.98    381.309763
        2017-08-20   4120.98   4211.08   4032.62   4086.29    467.083022
        2017-08-21   4069.13   4119.62   3911.79   4016.00    691.743060
...                      ...       ...       ...       ...           ...
        2021-11-10  66947.67  69000.00  62822.90  64882.43  65171.504050
        2021-11-11  64882.42  65600.07  64100.00  64774.26  37237.980580
        2021-11-12  64774.25  65450.70  62278.00  64122.23  44490.108160
        2021-11-13  64122.22  65000.00  63360.22  64380.00  22504.973830

The second ("right") dataframe is the latest data from the Binance API and looks like this:

                              open            high             low           close          volume
symbol  date                                                                                      
BTCUSDT 2021-11-11  64882.42000000  65600.07000000  64100.00000000  64774.26000000  37237.98058000
        2021-11-12  64774.25000000  65450.70000000  62278.00000000  64122.23000000  44490.10816000
        2021-11-13  64122.22000000  65000.00000000  63360.22000000  64380.00000000  22504.97383000
        2021-11-14  64380.01000000  65550.51000000  63576.27000000  65519.10000000  25705.07347000
        2021-11-15  65519.11000000  66401.82000000  63400.00000000  63606.74000000  37829.37124000
        2021-11-16  63606.73000000  63617.31000000  60293.06000000  60800.05000000  32920.17820000

I know I can get the max date from the first dataframe to set the data returned from the Binance API. But there could be cases where the data could overlap and it's critical that there not be any duplicate rows in the final output CSV file.

I want to concat/merge/join (not sure which) the two dataframes, discarding any overlapping data as determined by the index, with priority given to the data in the first/left dataframe.

I'm familiar with SQL - this would be a left outer join on the index columns. But I'm having trouble with the correct syntax for pandas dataframes.

The final dataframe should be:

                        open      high       low     close        volume
symbol  date                                                            
BTCUSDT 2017-08-17   4261.48   4485.39   4200.74   4285.08    795.150377
        2017-08-18   4285.08   4371.52   3938.77   4108.37   1199.888264
        2017-08-19   4108.37   4184.69   3850.00   4139.98    381.309763
        2017-08-20   4120.98   4211.08   4032.62   4086.29    467.083022
        2017-08-21   4069.13   4119.62   3911.79   4016.00    691.743060
...                      ...       ...       ...       ...           ...
        2021-11-10  66947.67  69000.00  62822.90  64882.43  65171.504050
        2021-11-11  64882.42  65600.07  64100.00  64774.26  37237.980580
        2021-11-12  64774.25  65450.70  62278.00  64122.23  44490.108160
        2021-11-13  64122.22  65000.00  63360.22  64380.00  22504.973830
>>>>>>>            
        2021-11-14  64380.01  65550.51  63576.27  65519.10  25705.073470
        2021-11-15  65519.11  66401.82  63400.00  63606.74  37829.371240
        2021-11-16  63606.73  63617.31  60293.06  60800.05  32920.178200

score 0 · Answer 1 · answered Nov 16 '21 at 10:06

0

This code should answer your question:

df_left['date']=pd.to_datetime(df_left['date'])
df_left['date']=df_left['date'].dt.strftime('%Y-%m-%d')
df_left['date']=pd.to_datetime(df_left['date'])


df_right['date']=pd.to_datetime(df_right['date'])
df_right['date']=df_right['date'].dt.strftime('%Y-%m-%d')
df_right['date']=pd.to_datetime(df_right['date'])

final_df=pd.merge(df_left, df_right, how='left', on='date', indicator=True)
final_df[final_df[_merge] == 'left_only']

or you can replace the last part with one liner: final_df=pd.merge(df_left, df_right, on='date', how="outer", indicator=True).query('_merge=="left_only"')

answered Nov 16 '21 at 10:06

Yashar Ahmadov

1,616
2
10
21

I get this error due to the index columns: final_df=pd.merge(df1, df2, how='left', on='date', indicator=True) Traceback (most recent call last): ValueError: You are trying to merge on datetime64[ns] and object columns. If you wish to proceed you should use pd.concat I'd prefer not to have to drop and recreate the index if the two dataframes can be concatenated with the index(es) in place. – Scott Nov 16 '21 at 10:27
Do these at the beginning and then run the code above: `df_left=df_left.dropna()`,`df_right=df_right.dropna()` – Yashar Ahmadov Nov 16 '21 at 10:36
In any case make sure that you don't have some null or non-date strings in your date columns. Please let me know the outcome. – Yashar Ahmadov Nov 16 '21 at 11:43

score 0 · Accepted Answer · answered Nov 17 '21 at 23:52

I read the pandas documentation on concat, merge, and join, as well as various blogs.

This blog was very helpful: https://www.kite.com/blog/python/pandas-merge-join-concat/. In summary, it points to the use of concat since I am trying to append two dataframes vertically.

I tried multiple variations of concat, merge, and join, but finally settled on this approach: Pandas/Python: How to concatenate two dataframes without duplicates?

def append_non_duplicates(a, b, col=None):
    if (a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame):
        raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
    if a is None:
        return b
    if b is None:
        return a
    if col is not None:
        aind = a.iloc[:, col].values
        bind = b.iloc[:, col].values
    else:
        aind = a.index.values
        bind = b.index.values
    take_rows = list(set(bind)-set(aind))
    take_rows = [i in take_rows for i in bind]
    return a.append(b.iloc[take_rows, :])

Like Daniel, I too am surprised there isn't an easier way to do this in out-of-the-box pandas.

Concat/merge/join two dataframes removing duplicate rows from the 2nd dataframe based on the index

2 Answers2