Avoiding duplicate data in a dataframe concat/merge/join

Question

I am trying to concat 2 DataFrames, but .join is creating an unwanted duplicate.

df_ask:

timestamp      price      volume
1520259290     10.5       100
1520259275     10.6       2000
1520259275     10.55      200

df_bid:
timestamp      price      volume
1520259290     10.25      500
1520259280     10.2       300
1520259275     10.1       400

I tried:

depth = pd.concat([df_ask,df_bid], axis=1, keys=['Ask Orders','Bid Orders'])

but that returns an error which I do understand ("concat failed Reindexing only valid with uniquely valued Index objects")

and I tried:

 df_ask.join(df_bid, how='outer', lsuffix='_ask', rsuffix='_bid')

Which gives no error, but gives the following dataframe:

timestamp      price_ask   volume_bid   price_bid   volume_bid
1520259290     10.5         100          10.25       500
1520259280      NaN         NaN          10.2        300
1520259275     10.6         2000         10.1        400
1520259275     10.55        200          10.1        400

My problem is the repeated 10.1 and 400 at timestamp 1520259275. They weren't in the original df_bid dataframe twice and should only be in this df once. Having two rows of the same timestamp is correct as there are two ask rows at this time, however there should only be one bid information row associated with this timestamp. The other should be NaN.

ie What I'm looking for is this:

timestamp      price_ask   volume_bid   price_bid   volume_bid
1520259290     10.5         100          10.25       500
1520259280      NaN         NaN          10.2        300
1520259275     10.6         2000         10.1        400
1520259275     10.55        200           NaN        NaN

I've looked through the merge/join/concat documentation and this question but I can't find what I'm looking for. Thanks in advance

You are using `Time` as an index... and it isn't unique. I don't understand what you want. Please edit your post and show what you think you want. — piRSquared, Mar 05 '18 at 15:01
@ piRSquared Correct. `Time` is not going to be unique. btw I changed the index to `timestamp` to be less confusing. I've updated the question to show what I'm looking for. — ACF, Mar 05 '18 at 15:08

score 1 · Accepted Answer · answered Mar 05 '18 at 15:16

You are implicitly assuming that the first instance of an index should be aligned with the other first instance of an index. In that case, use groupby + cumcount to establish an ordering of each unique index.

df_ask = df_ask.set_index(df_ask.groupby('timestamp').cumcount(), append=True)
df_bid = df_bid.set_index(df_bid.groupby('timestamp').cumcount(), append=True)

df_ask.join(df_bid, how='outer', lsuffix='_ask', rsuffix='_bid')

              price_ask  volume_ask  price_bid  volume_bid
timestamp                                                 
1520259275 0      10.60      2000.0      10.10       400.0
           1      10.55       200.0        NaN         NaN
1520259280 0        NaN         NaN      10.20       300.0
1520259290 0      10.50       100.0      10.25       500.0

Yep! works, but does add a count column to the index which I was able to get rid of with: `df.reset_index(inplace=True)` `df.drop(df.columns[1], axis=1, inplace=True)` `df.set_index('timestamp',inplace=True)` `df.sort_index(axis=0, inplace=True)` — ACF, Mar 05 '18 at 17:46

Avoiding duplicate data in a dataframe concat/merge/join

1 Answers1