8

So I have this dataframe (as below), I am trying to join itself by copying it into another df. The join condition as below; Join condition:

  1. Same PERSONID and Badge_ID
  2. But different SITE_ID1
  3. Timedelta between the two rows should be less than 48 hrs.

Expecting

PERSONID    Badge_ID    Reader_ID1_x    SITE_ID1_x  EVENT_TS1_x         Reader_ID1_y    SITE_ID1_x  EVENT_TS1_y
2553-AMAGID 4229        141                 99          2/1/2016 3:26   145                 97          2/1/2016 3:29
2553-AMAGID 4229        248                 99          2/1/2016 3:26   145                 97          2/1/2016 3:29
2553-AMAGID 4229        145                 97          2/1/2016 3:29   251                 99          2/1/2016 3:29
2553-AMAGID 4229        145                 97          2/1/2016 3:29   291                 99          2/1/2016 3:29

Here is what I tired, Make a copy of df and then filter each df with this condition like below and then join them back again. But the below condition doesn't work :( I tried this filters in SQL before reading into df but that's too slow for 600k+ rows, event with indexes.

df1 = df1[(df1['Badge_ID']==df2['Badge_ID']) and (df1['SITE_ID1']!=df2['SITE_ID1']) and ((df1['EVENT_TS1']-df2['EVENT_TS1'])<=datetime.timedelta(hours=event_time_diff))]

PERSONID    Badge_ID    Reader_ID1  SITE_ID1              EVENT_TS1
2553-AMAGID     4229    141             99          2/1/2016 3:26:10 AM
2553-AMAGID     4229    248             99          2/1/2016 3:26:10 AM
2553-AMAGID     4229    145             97          2/1/2016 3:29:56 AM
2553-AMAGID     4229    251             99          2/1/2016 3:29:56 AM
2553-AMAGID     4229    291             99          2/1/2016 3:29:56 AM
2557-AMAGID     4219    144             99          2/1/2016 2:36:30 AM
2557-AMAGID     4219    144             99          2/1/2016 2:40:00 AM
2557-AMAGID     4219    250             99          2/1/2016 2:40:00 AM
2557-AMAGID     4219    290             99          2/1/2016 2:40:00 AM
2557-AMAGID     4219    144             97          2/1/2016 4:02:06 AM
2557-AMAGID     4219    250             99          2/1/2016 4:02:06 AM
2557-AMAGID     4219    290             99          2/1/2016 4:02:06 AM
2557-AMAGID     4219    250             97          2/2/2016 1:36:30 AM
2557-AMAGID     4219    290             99          2/3/2016 2:38:30 AM
2559-AMAGID     4227    141             99          2/1/2016 4:33:24 AM
2559-AMAGID     4227    248             99          2/1/2016 4:33:24 AM
2560-AMAGID     4226    141             99          2/1/2016 4:10:56 AM
2560-AMAGID     4226    248             99          2/1/2016 4:10:56 AM
2560-AMAGID     4226    145             99          2/1/2016 4:33:52 AM
2560-AMAGID     4226    251             99          2/1/2016 4:33:52 AM
2560-AMAGID     4226    291             99          2/1/2016 4:33:52 AM
2570-AMAGID     4261    141             99          2/1/2016 4:27:02 AM
2570-AMAGID     4261    248             99          2/1/2016 4:27:02 AM
2986-AMAGID     4658    145             99          2/1/2016 3:14:54 AM
2986-AMAGID     4658    251             99          2/1/2016 3:14:54 AM
2986-AMAGID     4658    291             99          2/1/2016 3:14:54 AM
2986-AMAGID     4658    144             99          2/1/2016 3:26:30 AM
2986-AMAGID     4658    250             99          2/1/2016 3:26:30 AM
2986-AMAGID     4658    290             99          2/1/2016 3:26:30 AM
4133-AMAGID     6263    142             99          2/1/2016 2:44:08 AM
4133-AMAGID     6263    249             99          2/1/2016 2:44:08 AM
4133-AMAGID     6263    141             34          2/1/2016 2:44:20 AM
4133-AMAGID     6263    248             34          2/1/2016 2:44:20 AM
4414-AMAGID     6684    145             99          2/1/2016 3:08:06 AM
4414-AMAGID     6684    251             99          2/1/2016 3:08:06 AM
4414-AMAGID     6684    291             99          2/1/2016 3:08:06 AM
4414-AMAGID     6684    145             22          2/1/2016 3:19:12 AM
4414-AMAGID     6684    251             22          2/1/2016 3:19:12 AM
4414-AMAGID     6684    291             22          2/1/2016 3:19:12 AM
4414-AMAGID     6684    145             99          2/1/2016 4:14:28 AM
4414-AMAGID     6684    251             99          2/1/2016 4:14:28 AM
4414-AMAGID     6684    291             99          2/1/2016 4:14:28 AM
4484-AMAGID     6837    142             99          2/1/2016 2:51:14 AM
4484-AMAGID     6837    249             99          2/1/2016 2:51:14 AM
4484-AMAGID     6837    141             99          2/1/2016 2:51:26 AM
4484-AMAGID     6837    248             99          2/1/2016 2:51:26 AM
4484-AMAGID     6837    141             99          2/1/2016 3:05:12 AM
4484-AMAGID     6837    248             99          2/1/2016 3:05:12 AM
4484-AMAGID     6837    141             99          2/1/2016 3:08:58 AM
4484-AMAGID     6837    248             99          2/1/2016 3:08:58 AM
PyRaider
  • 607
  • 4
  • 11
  • 21
  • What does you data look like to start with ? – piRSquared May 19 '17 at 23:55
  • The data shown is what I am working with, with columns (PERSONID Badge_ID Reader_ID1 SITE_ID1 EVENT_TS1) – PyRaider May 19 '17 at 23:56
  • Can you show a smaller sample with what you expect to get as a result. I've got a working example but I have no real way of knowing if it is what you want. You should include in that example a row that exceeds the 48 hour threshold and it not being in the result set. – piRSquared May 19 '17 at 23:58
  • I just added an example the match because they belong to the same PersonID and Badge_ID but different Site_ID and are within 48 hours. There could be more combination of it but just gave an example. – PyRaider May 20 '17 at 00:11
  • How is sql too slow over 600k rows? – Andrew L May 20 '17 at 00:39
  • 1
    @AndrewL 600k rows from one table with a cartesian product on the same table adds up a lot. – PyRaider May 22 '17 at 20:30
  • By definition a join is not a cartesian product otherwise you're doing it incorrectly. – Andrew L May 22 '17 at 20:59
  • I stumbled on this thread and am trying something similar in SQL. Only 260k rows on a self join with covering indices is taking forever, so I will use the example here to do this in pandas. – horcle_buzz Mar 08 '19 at 01:41

1 Answers1

15

Try the following:

# Transform data in first dataframe
df1 = pd.DataFrame(data)

# Save the data in another datframe
df2 = pd.DataFrame(data)

# Rename column names of second dataframe 
df2.rename(index=str, columns={'Reader_ID1': 'Reader_ID1_x', 'SITE_ID1': 'SITE_ID1_x', 'EVENT_TS1': 'EVENT_TS1_x'}, inplace=True)

# Merge the dataframes into another dataframe based on PERSONID and Badge_ID
df3 = pd.merge(df1, df2, how='outer', on=['PERSONID', 'Badge_ID'])

# Use df.loc() to fetch the data you want
df3.loc[(df3.Reader_ID1 < df3.Reader_ID1_x) & (df3.SITE_ID1 != df3.SITE_ID1_x) & (pd.to_datetime(df3['EVENT_TS1']) - pd.to_datetime(df3['EVENT_TS1_x'])<=datetime.timedelta(hours=event_time_diff))]
manojps
  • 311
  • 2
  • 7