0

The given dataframe contains a host, a ip address and the timestamp the ip has requested that host.

import pandas as pd

data = {
    'host': [
        'google.com',
        'yahoo.com',
        'wikipedia.org',
        'stackoverflow.com',
        'amazon.com',
        'google.com',
        'yahoo.com',
        'wikipedia.org',
        'wikipedia.org',
        'stackoverflow.com',
        'amazon.com',
        'google.com'
    ],
    'ip': [
        '192.168.1.1',
        '192.168.1.1',
        '192.168.1.1',
        '192.168.1.2',
        '192.168.1.2',
        '192.168.1.1',
        '192.168.1.2',
        '192.168.1.1',
        '192.168.1.2',
        '192.168.1.1',
        '192.168.1.2',
        '192.168.1.1',
    ],
    'ts': [
        1603968729,
        1603968829,
        1603968889,
        1603968529,
        1603968810,
        1603968564,
        1603968443,
        1603968765,
        1603968811,
        1603968278,
        1603968149,
        1603968001
    ]
}

df = pd.DataFrame(data, columns = ['host', 'ip', 'ts'])

Now I can group the df to count the unique ip addresses for each host

x = df.groupby(['host','ip'])['ip'].count()

host               ip         
amazon.com         192.168.1.2    2
google.com         192.168.1.1    3
stackoverflow.com  192.168.1.1    1
                   192.168.1.2    1
wikipedia.org      192.168.1.1    2
                   192.168.1.2    1
yahoo.com          192.168.1.1    1
                   192.168.1.2    1

And I know how to get the latest timestamp of the unique ip

y = df.groupby(['host','ip'])['ts'].max()

amazon.com         192.168.1.2    1603968810
google.com         192.168.1.1    1603968729
stackoverflow.com  192.168.1.1    1603968278
                   192.168.1.2    1603968529
wikipedia.org      192.168.1.1    1603968889
                   192.168.1.2    1603968811
yahoo.com          192.168.1.1    1603968829
                   192.168.1.2    1603968443

But what I actually want to achive is the latest timestamp and the count of each ip. A table like:

amazon.com         192.168.1.2    1603968810    2
google.com         192.168.1.1    1603968729    3
stackoverflow.com  192.168.1.1    1603968278    1
                   192.168.1.2    1603968529    1
wikipedia.org      192.168.1.1    1603968889    2
                   192.168.1.2    1603968811    1
yahoo.com          192.168.1.1    1603968829    1
                   192.168.1.2    1603968443    1

Please may anyone provide the missing step.

Roger Sánchez
  • 87
  • 1
  • 1
  • 8

1 Answers1

1
df.groupby(by=['host', 'ip'])['ts'].agg(['max', 'count'])

You group by the two properties and call multiple aggregation functions using agg.

Tom Ron
  • 5,906
  • 3
  • 22
  • 38