How to group HTTP requests log using pandas

Question

I have a HTTP requests log. The included features are: capture_time, ip, method, url, content, user_agent

All this information is in a csv file.

i want to group all requests from the same IP between a 10 minutes interval.

how can i do that using pandas?

example dataset:

date ip method url content agent

2019-04-24 23:16:48.742466
187.20.211.99
POST
/delivery/check_location
bairro=Vila&cidade=Lima
Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML like Gecko) Mobile/15E148

i have already tried using groupby method.

i would like to merge all the requests content in one row (for those grouped using ip and time)

So you only care about the time and the ip, the other info doesn't matter? Do you need a count of that IP in the same timespan? — Ben Pap, May 15 '19 at 23:21
i wanna group them according to the ip and the time interval between each request. (10 min). the method, url and content i want to concat then on the same row. Ex: POST url content GET url2 conten2 ... — Lucca Zenobio, May 15 '19 at 23:24
would the same IP have different method url and content in same timespan? — Ben Pap, May 15 '19 at 23:25
yes. the method and content can be different in each request. — Lucca Zenobio, May 15 '19 at 23:26
So if that's the case, you would still only want 1 row for that IP? — Ben Pap, May 15 '19 at 23:27
yes. one row, with all metthods, urls and contents concatenated — Lucca Zenobio, May 15 '19 at 23:28

Ben Pap · Answer 1 · 2019-05-16T17:03:03.420

1

df.set_index('date', inplace = True)

unnesting(df.resample('10T')['ip'].unique().reset_index(), ['ip']).reset_index(drop = True)

First you need to set your date to your index. Next you need to resample the time in 10 min increments, look at your IP column and get the unique ones for each time span. Next you need to unnest the lists created by unique() by using the below function.

##https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe/55839330#55839330

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

After this you can concatenate whatever you were planning.

EDIT:

# Set index to the date column
df.set_index('date', inplace = True)

# 10 minutes in nanoseconds 
ns10min=10*60*1000000000

#Calculate the new 10 min.   
df.index = pd.to_datetime(((df.index.astype(np.int64) // ns10min) * ns10min))

#Groupby both index and ip, then look at the first.
df.groupby([df.index, df['ip']]).first()

edited May 16 '19 at 17:03

answered May 15 '19 at 23:58

Ben Pap

2,549
1
8
17

seems to work great. i just dont know how to concatenate the other fields, as it creates a new dataframe with the ip and the time interval. any ideas? – Lucca Zenobio May 16 '19 at 00:17
glad to see you are using unnest :-) – BENY May 16 '19 at 00:29
@WeNYoBen yup! I read through that thread pretty thoroughly to understand all the pro's/con's of different methods, and your method is pretty handy :D – Ben Pap May 16 '19 at 05:27
@LuccaZenobio Soo this is why I was asking you questions before hand, if an IP address appears twice in 10 min with different other columns, you can't concat it so it remains a single row. Unless you want your DF to be really wide with columns repeating? – Ben Pap May 16 '19 at 05:30
i want to join all the other columns content in one. just like it is with your tip but with one more column with all the values concatenated. if i could get at least the indexes i can do that – Lucca Zenobio May 16 '19 at 13:29

Lucca Zenobio · Accepted Answer · 2019-05-16T17:36:16.223

I used Ben Pap's method to group the ips based on the date. After that i got an dataframe containing the IPs and the time intervals. To join the other columns and add to this dataframe i did this:

content= []
row_iterator = test.iterrows()
for index, row in row_iterator:
    texto = ""
    resul = df2.loc[(df2[df2.columns[1]] == row[2]) & ((row[0] < df2.index) & (df2.index <  row[0] + pd.Timedelta(minutes=10) ) )]
    for i, (_, current_row) in enumerate(resul.iterrows()):
        texto += " " + current_row.values[2] + " " + current_row.values[3] + " " + current_row.values[4] 
     content.append(texto)

How to group HTTP requests log using pandas

2 Answers2