1

I have a dataframe with the following structure:

event_timestamp      message_number  an_robot     check
2015-04-15 12:09:39  10125            robot_7     False
2015-04-15 12:09:41  10053            robot_4     True
2015-04-15 12:09:44  10156_ad         robot_7     True
2015-04-15 12:09:47  20205            robot_108   False
2015-04-15 12:09:51  10010            robot_38    True
2015-04-15 12:09:54  10012            robot_65    True
2015-04-15 12:09:59  10011            robot_39    True
2015-04-15 12:10:01  87954            robot_2     False
......etc

The check column gives insight in whether or not the row should be merged in this manner:

event timestamp: first
 message number: combine (e.g., 10053,10156)
       an_robot: combine (e.g., robot_4, robot_7)
          check: can be removed after the operation.

So far, I have succeeded using groupby to get the correct values for the True and False values in the check column:

df.groupby(by='check').agg({'event_timestamp':'first',
                            'message_number':lambda x: ','.join(x),
                            'an_robot':lambda x: ','.join(x)}.reset_index()

which outputs:

     check    event_timestamp        message_number         an_robot
0    False    2015-04-15 12:09:39    10125,10053,..,87954   robot_7,robot_4, ... etc
1    True     2015-04-15 12:09:51    10010,10012            robot_38,robot_65

However, the end result would ideally be the following. The 10053 and 10156_ad rows are combined and the 10010,10012,10011 rows are combined. In the full dataframe, the maximum length a sequence can be is 5. I have a separate dataframe with those rules (like the 10010,10012,10011 rule).

event_timestamp      message_number        an_robot
2015-04-15 12:09:39  10125                 robot_7
2015-04-15 12:09:41  10053,10156_ad        robot_4,robot_7
2015-04-15 12:09:47  20205                 robot_108
2015-04-15 12:09:51  10010,10012,10011     robot_38,robot_65,robot_39
2015-04-15 12:10:01  87954                 robot_2

How could I achieve this?

--EDIT--

The dataset with the separate rules looks like follows:

sequence             support
10053,10156,20205    0.94783
10010,10012          0.93322
10010,10033          0.93211
10053,10032          0.92222
etc....

the code that determines when a row in check will be true or false:

def find_drops(seq, df):
    if seq:
        m = np.logical_and.reduce([df.message_number.shift(-i).eq(seq[i]) for i in range(len(seq))])
        if len(seq) == 1:
            return pd.Series(m, index=df.index)
        else:
            return pd.Series(m, index=df.index).replace({False: np.NaN}).ffill(limit=len(seq)-1).fillna(False)
    else:
        return pd.Series(False, index=df.index)

If i then run df['check'] = find_drops(['10010', '10012', '10011'], df) i will get the check column with True's for the these rows. It would be great if it was possible to run this for each row in the dataframe with the rules and then merge the rows with the code provided.

--new code 4-17-2019--

df = """event_timestamp|message_number|an_robot
2015-04-15 12:09:39|10125|robot_7
2015-04-15 12:09:41|10053|robot_4
2015-04-15 12:09:44|10156_ad|robot_7
2015-04-15 12:09:47|20205|robot_108
2015-04-15 12:09:48|45689|robot_23
2015-04-15 12:09:51|10010|robot_38
2015-04-15 12:09:54|10012|robot_65
2015-04-15 12:09:58|98765|robot_99
2015-04-15 12:09:59|10011|robot_39
2015-04-15 12:10:01|87954|robot_2"""

df = pd.read_csv(io.StringIO(df), sep='|')

df1 = """sequence|support
10053,10156_ad,20205|0.94783
10010,10012|0.93322
10011,87954|0.92222
"""

df1 = pd.read_csv(io.StringIO(df1), sep='|')
patterns = df1['sequence'].str.split(',')

used_idx = []
c = ['event_timestamp','message_number','an_robot']
def find_drops(seq):
    if seq:
        m = np.logical_and.reduce([df.message_number.shift(-i).eq(seq[i]) for i in range(len(seq))])
        if len(seq) == 1:
            df2 = df.loc[m,  c].assign(g = df.index[m])
            used_idx.extend(df2.index.tolist())
            return df2
        else:
            m1 = (pd.Series(m, index=df.index).replace({False: np.NaN})
                                               .ffill(limit=len(seq)-1)
                                               .fillna(False))
            df2 = df.loc[m1,  c]
            used_idx.extend(df2.index.tolist())
            df2['g'] = np.where(df2.index.isin(df.index[m]), df2.index, np.nan)
            return df2


out = (pd.concat([find_drops(x) for x in patterns])
        .assign(g = lambda x: x['g'].ffill())
        .groupby(by=['g']).agg({'event_timestamp':'first',
                                 'message_number':','.join, 
                                 'an_robot':','.join})
        .reset_index(drop=True))

c = ['event_timestamp','message_number','an_robot']
df2 = df[~df.index.isin(used_idx)]
df2 = pd.DataFrame([[df2['event_timestamp'].iat[0], 
                    ','.join(df2['message_number']),
                    ','.join(df2['an_robot'])]], columns=c)

fin = pd.concat([out, df2], ignore_index=True)
fin.event_timestamp = pd.to_datetime(fin.event_timestamp)
fin = fin.sort_values('event_timestamp')
fin

output is:

event_timestamp      message_number           an_robot
2015-04-15 12:09:39  10125,45689,98765,12345  robot_7,robot_23,robot_99
2015-04-15 12:09:41  10053,10156_ad,20205     robot_4,robot_7,robot_108
2015-04-15 12:09:51  10010,10012              robot_38,robot_65
2015-04-15 12:09:59  10011,87954              robot_39,robot_2

should be:

event_timestamp      message_number        an_robot
2015-04-15 12:09:39  10125                 robot_7
2015-04-15 12:09:41  10053,10156_ad,20205  robot_4,robot_7,robot_108
2015-04-15 12:09:48  45689                 robot_23
2015-04-15 12:09:51  10010,10012           robot_38,robot_65
2015-04-15 12:09:58  98765                 robot_99
2015-04-15 12:09:59  10011,87954           robot_39,robot_2
2015-04-15 12:10:03  12345                 robot_1
intStdu
  • 291
  • 1
  • 11

2 Answers2

1

You could classify the message number before you group them. Best would be to have these classifications rules in a dataframe, 1 classification per number.

class_df = pd.DataFrame(data={'message_number': ['10010', '10012', '10011', '10053', '10156_ad'],
                              'class': ['a', 'a', 'a', 'b', 'b']})

You can then merge them

results = pd.merge(df, class_df, on=['message_number'], how='left)

Then you can group by class and check

results.groupby(by=['check', 'class']).agg({'event_timestamp':'first',
                                            'message_number':lambda x: ','.join(x),
                                            'an_robot':lambda x: ','.join(x)}.reset_index()
tvgriek
  • 1,215
  • 9
  • 20
1

Question is more complex, so complately changed.

First step is preprocessing - filter only values exist in sequencies by Series.isin and boolean indexing:

patterns = df1['sequence'].str.split(',')
print (patterns)

#flatten lists to sets
flatten = set([y for x in patterns for y in x])
#print (flatten)

df1 = df[df['message_number'].isin(flatten)]
#print (df1)

First solution is modified this answer - added groupby for sequencies with length > 1, call function for each value and last join together by concat:

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c

used_idx = []

def agg_pattern(seq):
    if seq:
        N = len(seq)
        arr = df1['message_number'].values
        b = np.all(rolling_window(arr, N) == seq, axis=1)
        c = np.mgrid[0:len(b)][b]

        d = [i  for x in c for i in range(x, x+N)]
        used_idx.extend(df1.index.values[d])
        m = np.in1d(np.arange(len(arr)), d)

        di = {'event_timestamp':'first','message_number':','.join, 'an_robot':','.join}

        if len(seq) == 1:
            return df1.loc[m, ['event_timestamp','message_number','an_robot']]
        else:
            df2 = df1[m]
            return df2.groupby(np.arange(len(df2)) // N).agg(di)


out = pd.concat([agg_pattern(x) for x in patterns], ignore_index=True)

Your solution should be changed for create helper column g used for grouping in last step:

used_idx = []
c = ['event_timestamp','message_number','an_robot']
def find_drops(seq):
    if seq:
        m = np.logical_and.reduce([df1.message_number.shift(-i).eq(seq[i]) for i in range(len(seq))])
        if len(seq) == 1:
            df2 = df1.loc[m,  c].assign(g = df1.index[m])
            used_idx.extend(df2.index.tolist())
            return df2
        else:
            m1 = (pd.Series(m, index=df1.index).replace({False: np.NaN})
                                               .ffill(limit=len(seq)-1)
                                               .fillna(False))
            df2 = df1.loc[m1,  c]
            used_idx.extend(df2.index.tolist())
            df2['g'] = np.where(df2.index.isin(df1.index[m]), df2.index, np.nan)
            return df2


out = (pd.concat([find_drops(x) for x in patterns])
        .assign(g = lambda x: x['g'].ffill())
        .groupby(by=['g']).agg({'event_timestamp':'first',
                                 'message_number':','.join, 
                                 'an_robot':','.join})
        .reset_index(drop=True))

print (used_idx)

Last create new DataFrame from False values and join to output:

print (out)
       event_timestamp        message_number                   an_robot
0  2015-04-15 12:09:41  10053,10156_ad,20205  robot_4,robot_7,robot_108
1  2015-04-15 12:09:51           10010,10012          robot_38,robot_65
2  2015-04-15 12:09:59           10011,87954           robot_39,robot_2

c = ['event_timestamp','message_number','an_robot']
df2 = pd.concat([out, df[~df.index.isin(used_idx)]]).sort_values('event_timestamp')
print(df2)
       event_timestamp        message_number                   an_robot
0  2015-04-15 12:09:39                 10125                    robot_7
0  2015-04-15 12:09:41  10053,10156_ad,20205  robot_4,robot_7,robot_108
4  2015-04-15 12:09:48                 45689                   robot_23
1  2015-04-15 12:09:51           10010,10012          robot_38,robot_65
7  2015-04-15 12:09:58                 98765                   robot_99
2  2015-04-15 12:09:59           10011,87954           robot_39,robot_2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • wow this works great indeed! i have also added the look of the dataframe with the rules. How could i convert these rules to the rules variable that you made? Also, would it still work for rules that have common codes (rule 3 in the dataframe has 10010 in it as well)? – intStdu Apr 16 '19 at 12:41
  • 1
    oh wait, before you answer it. i will upload the code that determines wether or not check is true or false, maybe we can combine the codes. – intStdu Apr 16 '19 at 12:52
  • both codes work great! How should these two solutions be altered to a version where the rows that are not in patterns/sequences df are also in the end dataframe, like in my expected output? Because now the end dataframe is only the rows that contain joined messagenumbers and robots. – intStdu Apr 17 '19 at 10:00
  • i tried using the new code, however i get the error that Dataframe object has no attribute message_number. I changed the line in the find_drops function in the beginning to df.message_number, but that gave me the error that: wrong number of items passed 18, placement implies 4. – intStdu Apr 17 '19 at 13:38
  • is it okay if i message you to avoid making the discussion to extended? i got it to work, however: all the items that were not observed in the sequence dataframe are now appended to each other in one row – intStdu Apr 17 '19 at 14:38
  • i have added the final code used now, also example with output + expected output! As can be seen, all the items that are not in the sequence are added to the first row. those message_number values should all be on a separate row – intStdu Apr 17 '19 at 15:10
  • 1
    @intStdu - Edited answer. – jezrael Apr 18 '19 at 07:44
  • I noticed something else. Sometimes, more than 1 sequence get's added to each other. I think that happens because they appear after each other. Could that be prevented by first giving this algorithm the longest sequences, then the ones of length 4, then 3, then 2? – intStdu Apr 19 '19 at 09:15
  • @intStdu - Do you think sort nested lists by length, like [this](https://stackoverflow.com/q/30346356/2901002) ? – jezrael Apr 19 '19 at 11:49