4

I am using pandas to analyse existing ssh sessions to different nodes, for that I have parsed the ssh daemon log and I have a DataFrame that contains the following columns:

  • Node: the name of the node where the connection was established
  • Session: the ID of the session
  • Start: timestamp indicating when the connection started
  • Finish: timestamp indicanting when the connection ended

Here's a part of the data:

In [375]: sessions[1:10]                                 
Out[375]: 
    Node  Session               Start              Finish
1  svg01    27321 2015-02-23 07:24:45 2015-02-23 07:50:57
2  svg02    14171 2015-02-23 10:25:08 2015-02-23 14:33:24
3  svg02    14273 2015-02-23 10:26:21 2015-02-23 14:36:19
4  svg01    14401 2015-02-23 10:28:16 2015-02-23 14:38:04
5  svg01    26408 2015-02-23 14:01:49 2015-02-23 18:38:25
6  svg03    13722 2015-02-23 18:24:39 2015-02-23 20:51:59
7  svg05    17637 2015-02-23 19:10:00 2015-02-23 19:10:20

I want to generate an additional column that has the number of established sessions in a given node at when a new connection is established.

Without taking into account the Node I can compute this using:

count_sessions = lambda t: sessions[(sessions.Start<t) & (sessions.Finish>t)].shape[0]
sessions['OpenSessions'] = sessions['Start'].map(count_sessions)

The problem is that I would also need to take into account the 'Node' column value but I do not know how to get it.

I could use the index of the element in the Series to get the node in the sessions DataFrame but I did not found any way to retrieve the index of the element passed to the map.

Matt
  • 17,290
  • 7
  • 57
  • 71
alcachi
  • 598
  • 6
  • 12
  • Could you provide a CSV? It's a lot easier to fiddle around with your data if one can just parse it directly into a DataFrame ... – filmor Mar 03 '15 at 10:23
  • Here is the dataset: https://www.dropbox.com/s/a8q73338nz2okvy/sessions.csv – alcachi Mar 03 '15 at 11:31
  • This method is O(n**2), if `sessions` has many rows, the method is slow. You can speed up O(n*log(n)) if you use sort & cumsum. – HYRY Mar 03 '15 at 13:13
  • @HYRY Good insight, and something definitely to take into account when moving to full logs. I think I understand the sort part but how can afterwards use cumsum to achieve same result? – alcachi Mar 03 '15 at 21:58
  • Merge start and finish timestamps and create another column that has value 1 for start timestamps and -1 for finish timestamps. Then sort timestamps and cumsum the values. – HYRY Mar 03 '15 at 22:34

2 Answers2

1
def count(df):
    count_sessions = lambda t: df[(df.Start<t) & (df.Finish>t)].shape[0]
    df['OpenSessions'] = df['Start'].map(count_sessions)
    return df

print sessions.groupby('Node').apply(count)

The output is:

    Node  Session                Start               Finish  OpenSessions
0  svg01    27321  2015-02-23 07:24:45  2015-02-23 07:50:57             0
1  svg02    14171  2015-02-23 10:25:08  2015-02-23 14:33:24             0
2  svg02    14273  2015-02-23 10:26:21  2015-02-23 14:36:19             1
3  svg01    14401  2015-02-23 10:28:16  2015-02-23 14:38:04             0
4  svg01    26408  2015-02-23 14:01:49  2015-02-23 18:38:25             1
5  svg03    13722  2015-02-23 18:24:39  2015-02-23 20:51:59             0
6  svg05    17637  2015-02-23 19:10:00  2015-02-23 19:10:20             0

Read this for inspiration.

1

Just a suggestion about another way to proceed: I am not sure about the criteria but you should be able to adapt this easily:

sessions['OpenSessions'] = sessions.apply(\
  lambda row: len(sessions[(sessions['Start'] < row['Start']) &\
                           (sessions['Finish'] > row['Finish']) &\
                           (sessions['Node'] == row['Node'])]), axis = 1)

For each row (argument axis = 1), it simply counts the number of lines in your dataframe which match any criteria you want based on the row values.

etna
  • 1,083
  • 7
  • 13