0

I have a HTTP requests log. The included features are: capture_time, ip, method, url, content, user_agent

All this information is in a csv file.

i want to group all requests from the same IP between a 10 minutes interval.

how can i do that using pandas?

example dataset:

date ip method url content agent
  • 2019-04-24 23:16:48.742466
  • 187.20.211.99
  • POST
  • /delivery/check_location
  • bairro=Vila&cidade=Lima
  • Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML like Gecko) Mobile/15E148

i have already tried using groupby method.

i would like to merge all the requests content in one row (for those grouped using ip and time)

Lucca Zenobio
  • 192
  • 1
  • 1
  • 9

2 Answers2

1
df.set_index('date', inplace = True)

unnesting(df.resample('10T')['ip'].unique().reset_index(), ['ip']).reset_index(drop = True)

First you need to set your date to your index. Next you need to resample the time in 10 min increments, look at your IP column and get the unique ones for each time span. Next you need to unnest the lists created by unique() by using the below function.

##https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe/55839330#55839330

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

After this you can concatenate whatever you were planning.

EDIT:

# Set index to the date column
df.set_index('date', inplace = True)

# 10 minutes in nanoseconds 
ns10min=10*60*1000000000

#Calculate the new 10 min.   
df.index = pd.to_datetime(((df.index.astype(np.int64) // ns10min) * ns10min))

#Groupby both index and ip, then look at the first.
df.groupby([df.index, df['ip']]).first()
Ben Pap
  • 2,549
  • 1
  • 8
  • 17
  • seems to work great. i just dont know how to concatenate the other fields, as it creates a new dataframe with the ip and the time interval. any ideas? – Lucca Zenobio May 16 '19 at 00:17
  • glad to see you are using unnest :-) – BENY May 16 '19 at 00:29
  • @WeNYoBen yup! I read through that thread pretty thoroughly to understand all the pro's/con's of different methods, and your method is pretty handy :D – Ben Pap May 16 '19 at 05:27
  • @LuccaZenobio Soo this is why I was asking you questions before hand, if an IP address appears twice in 10 min with different other columns, you can't concat it so it remains a single row. Unless you want your DF to be really wide with columns repeating? – Ben Pap May 16 '19 at 05:30
  • i want to join all the other columns content in one. just like it is with your tip but with one more column with all the values concatenated. if i could get at least the indexes i can do that – Lucca Zenobio May 16 '19 at 13:29
0

I used Ben Pap's method to group the ips based on the date. After that i got an dataframe containing the IPs and the time intervals. To join the other columns and add to this dataframe i did this:

content= []
row_iterator = test.iterrows()
for index, row in row_iterator:
    texto = ""
    resul = df2.loc[(df2[df2.columns[1]] == row[2]) & ((row[0] < df2.index) & (df2.index <  row[0] + pd.Timedelta(minutes=10) ) )]
    for i, (_, current_row) in enumerate(resul.iterrows()):
        texto += " " + current_row.values[2] + " " + current_row.values[3] + " " + current_row.values[4] 
     content.append(texto)
Lucca Zenobio
  • 192
  • 1
  • 1
  • 9