0

I have a combined packet capture from both sides of a network connection in WireShark. The capture is exported as a CSV file and every row contains among other things a unique ID and timestamp. Because I capture from both sides that means I will have two rows of every ID containing the send timestamp and receive timestamp. What I want to do is calculate the delay by subtracting these values. I have managed to do it but it takes roughly 12 seconds to go through my list of 17000 packets and I have 15 lists in total which would equal 3 minutes execution time with the following code:

data <- read.csv("normal-novpn.csv", sep=",", numerals="no.loss", header=TRUE)
ID = data.matrix(data[,7], rownames.force = NA)
time = data.matrix(as.double(as.character(data[,2])), rownames.force = NA)
time = time*1000000 # Time is now in microseconds

len <- nrow(ID)
mat <- matrix(,nrow=len,ncol=2)
for(i in 1:len){
    d <- unlist(strsplit(ID[i], " "))
    mat[i,1] <- as.numeric(gsub('[()]','',d[2]))
    mat[i,2] <- time[i]
}

delay = vector(length=len/2)
k <- 1
for(i in 1:len){
    for(j in i:len){
        if(mat[i,1] == mat[j,1] && mat[j,2] > mat[i,2]){
            delay[k] <- mat[j,2] - mat[i,2]
            k <- k+1
        }
    }
}

The rows in the CSV file are ordered in respect to time and a row looks like this:

"32","1505997726.015245358","10.0.10.70","10.0.10.1","UDP","214","0xa5f0 (42480)","50414  >  5201 Len=172"

where the timestamp is: "1505997726.015245358" and the ID is: "0xa5f0 (42480)"

My question is if I can do this more efficiently to reduce the execution time.

Update: This is a link to one of my CSV files containing the 17000 lines: https://justpaste.it/1bjoy

Here is a smaller file with only 10 lines of data + header. One thing to mention is that it is not true for all files that the duplicate IDs are next to each other in the list.

"No.","Time","Source","Destination","Protocol","Length","Identification","Info"
"120","1505984967.366049706","10.0.0.50","10.0.0.35","UDP","214","0x8dab (36267)","46670  >  5201 Len=172"
"123","1505984967.366440","10.0.0.50","10.0.0.35","UDP","214","0x8dab (36267)","46670  >  5201 Len=172"
"124","1505984967.386478504","10.0.0.50","10.0.0.35","UDP","214","0x8dac (36268)","46670  >  5201 Len=172"
"125","1505984967.386606","10.0.0.50","10.0.0.35","UDP","214","0x8dac (36268)","46670  >  5201 Len=172"
"130","1505984967.406353133","10.0.0.50","10.0.0.35","UDP","214","0x8db0 (36272)","46670  >  5201 Len=172"
"131","1505984967.406555","10.0.0.50","10.0.0.35","UDP","214","0x8db0 (36272)","46670  >  5201 Len=172"
"132","1505984967.426372842","10.0.0.50","10.0.0.35","UDP","214","0x8db1 (36273)","46670  >  5201 Len=172"
"133","1505984967.426558","10.0.0.50","10.0.0.35","UDP","214","0x8db1 (36273)","46670  >  5201 Len=172"
"134","1505984967.446282356","10.0.0.50","10.0.0.35","UDP","214","0x8db6 (36278)","46670  >  5201 Len=172"
"135","1505984967.446555","10.0.0.50","10.0.0.35","UDP","214","0x8db6 (36278)","46670  >  5201 Len=172"

Update 2: The order of the rows must remain as I will perform additional calculations of the new values. The first column "No." indicates the packet number as counted by WireShark and must be increasing while traversing down the list.

Jimmy
  • 227
  • 2
  • 12
  • please provide a reproducible example – moodymudskipper Sep 22 '17 at 08:43
  • You almost certainly can do this more efficiently, either using a self-join on ID, or with some clever use of `lead()`/`lag()` over the IDs to get the send/receive stamps and then calculate the difference. If you post a small example set I can provide some example code. – Jim Leach Sep 22 '17 at 08:56
  • @JimLeach : I have added a small sample of my data as well as a link to one of my data files that takes roughly 12 seconds to go through. – Jimmy Sep 22 '17 at 09:06

1 Answers1

0

Here is a fast solution using data.table. The file so_long.csv is this one fromn your edit.

library(data.table)
library(microbenchmark)

foo <- function() {
  dt <- fread("so_long.csv")
  dt[, Time := as.double(as.character(Time)) * 1000000]
  dt[, .(Delay = max(Time) - min(Time)), by = Identification]
}

head(foo())
# Identification   Delay
# 1:     0x0003 (3) 1749.75
# 2:     0x0004 (4) 1761.00
# 3:     0x0007 (7) 1887.50
# 4:     0x0009 (9) 1983.75
# 5:    0x000e (14) 1929.75
# 6:    0x0014 (20) 1948.50

microbenchmark(foo())
# Unit: milliseconds
# expr      min       lq     mean   median       uq      max neval
# foo() 38.28835 52.17356 64.48024 60.63322 72.21627 132.8679   100
bendae
  • 779
  • 3
  • 13
  • Thank you, but does this method change the order of the data? With your call to head(foo()) the values are ordered after identification which I can not do in my calculations. The original order according to the column "No." must remain. – Jimmy Sep 22 '17 at 14:02
  • Edited to not change the order (`setkey` wasn't needed in the first place) – bendae Sep 22 '17 at 14:15