I have a dataframe with the following structure:
event_timestamp message_number an_robot check
2015-04-15 12:09:39 10125 robot_7 False
2015-04-15 12:09:41 10053 robot_4 True
2015-04-15 12:09:44 10156_ad robot_7 True
2015-04-15 12:09:47 20205 robot_108 False
2015-04-15 12:09:51 10010 robot_38 True
2015-04-15 12:09:54 10012 robot_65 True
2015-04-15 12:09:59 10011 robot_39 True
2015-04-15 12:10:01 87954 robot_2 False
......etc
The check column gives insight in whether or not the row should be merged in this manner:
event timestamp: first
message number: combine (e.g., 10053,10156)
an_robot: combine (e.g., robot_4, robot_7)
check: can be removed after the operation.
So far, I have succeeded using groupby to get the correct values for the True and False values in the check column:
df.groupby(by='check').agg({'event_timestamp':'first',
'message_number':lambda x: ','.join(x),
'an_robot':lambda x: ','.join(x)}.reset_index()
which outputs:
check event_timestamp message_number an_robot
0 False 2015-04-15 12:09:39 10125,10053,..,87954 robot_7,robot_4, ... etc
1 True 2015-04-15 12:09:51 10010,10012 robot_38,robot_65
However, the end result would ideally be the following. The 10053 and 10156_ad
rows are combined and the 10010,10012,10011
rows are combined. In the full dataframe, the maximum length a sequence can be is 5. I have a separate dataframe with those rules (like the 10010,10012,10011 rule).
event_timestamp message_number an_robot
2015-04-15 12:09:39 10125 robot_7
2015-04-15 12:09:41 10053,10156_ad robot_4,robot_7
2015-04-15 12:09:47 20205 robot_108
2015-04-15 12:09:51 10010,10012,10011 robot_38,robot_65,robot_39
2015-04-15 12:10:01 87954 robot_2
How could I achieve this?
--EDIT--
The dataset with the separate rules looks like follows:
sequence support
10053,10156,20205 0.94783
10010,10012 0.93322
10010,10033 0.93211
10053,10032 0.92222
etc....
the code that determines when a row in check will be true or false:
def find_drops(seq, df):
if seq:
m = np.logical_and.reduce([df.message_number.shift(-i).eq(seq[i]) for i in range(len(seq))])
if len(seq) == 1:
return pd.Series(m, index=df.index)
else:
return pd.Series(m, index=df.index).replace({False: np.NaN}).ffill(limit=len(seq)-1).fillna(False)
else:
return pd.Series(False, index=df.index)
If i then run df['check'] = find_drops(['10010', '10012', '10011'], df)
i will get the check column with True's for the these rows. It would be great if it was possible to run this for each row in the dataframe with the rules and then merge the rows with the code provided.
--new code 4-17-2019--
df = """event_timestamp|message_number|an_robot
2015-04-15 12:09:39|10125|robot_7
2015-04-15 12:09:41|10053|robot_4
2015-04-15 12:09:44|10156_ad|robot_7
2015-04-15 12:09:47|20205|robot_108
2015-04-15 12:09:48|45689|robot_23
2015-04-15 12:09:51|10010|robot_38
2015-04-15 12:09:54|10012|robot_65
2015-04-15 12:09:58|98765|robot_99
2015-04-15 12:09:59|10011|robot_39
2015-04-15 12:10:01|87954|robot_2"""
df = pd.read_csv(io.StringIO(df), sep='|')
df1 = """sequence|support
10053,10156_ad,20205|0.94783
10010,10012|0.93322
10011,87954|0.92222
"""
df1 = pd.read_csv(io.StringIO(df1), sep='|')
patterns = df1['sequence'].str.split(',')
used_idx = []
c = ['event_timestamp','message_number','an_robot']
def find_drops(seq):
if seq:
m = np.logical_and.reduce([df.message_number.shift(-i).eq(seq[i]) for i in range(len(seq))])
if len(seq) == 1:
df2 = df.loc[m, c].assign(g = df.index[m])
used_idx.extend(df2.index.tolist())
return df2
else:
m1 = (pd.Series(m, index=df.index).replace({False: np.NaN})
.ffill(limit=len(seq)-1)
.fillna(False))
df2 = df.loc[m1, c]
used_idx.extend(df2.index.tolist())
df2['g'] = np.where(df2.index.isin(df.index[m]), df2.index, np.nan)
return df2
out = (pd.concat([find_drops(x) for x in patterns])
.assign(g = lambda x: x['g'].ffill())
.groupby(by=['g']).agg({'event_timestamp':'first',
'message_number':','.join,
'an_robot':','.join})
.reset_index(drop=True))
c = ['event_timestamp','message_number','an_robot']
df2 = df[~df.index.isin(used_idx)]
df2 = pd.DataFrame([[df2['event_timestamp'].iat[0],
','.join(df2['message_number']),
','.join(df2['an_robot'])]], columns=c)
fin = pd.concat([out, df2], ignore_index=True)
fin.event_timestamp = pd.to_datetime(fin.event_timestamp)
fin = fin.sort_values('event_timestamp')
fin
output is:
event_timestamp message_number an_robot
2015-04-15 12:09:39 10125,45689,98765,12345 robot_7,robot_23,robot_99
2015-04-15 12:09:41 10053,10156_ad,20205 robot_4,robot_7,robot_108
2015-04-15 12:09:51 10010,10012 robot_38,robot_65
2015-04-15 12:09:59 10011,87954 robot_39,robot_2
should be:
event_timestamp message_number an_robot
2015-04-15 12:09:39 10125 robot_7
2015-04-15 12:09:41 10053,10156_ad,20205 robot_4,robot_7,robot_108
2015-04-15 12:09:48 45689 robot_23
2015-04-15 12:09:51 10010,10012 robot_38,robot_65
2015-04-15 12:09:58 98765 robot_99
2015-04-15 12:09:59 10011,87954 robot_39,robot_2
2015-04-15 12:10:03 12345 robot_1