R - match pairs in matrix efficiently

Question

I have a combined packet capture from both sides of a network connection in WireShark. The capture is exported as a CSV file and every row contains among other things a unique ID and timestamp. Because I capture from both sides that means I will have two rows of every ID containing the send timestamp and receive timestamp. What I want to do is calculate the delay by subtracting these values. I have managed to do it but it takes roughly 12 seconds to go through my list of 17000 packets and I have 15 lists in total which would equal 3 minutes execution time with the following code:

data <- read.csv("normal-novpn.csv", sep=",", numerals="no.loss", header=TRUE)
ID = data.matrix(data[,7], rownames.force = NA)
time = data.matrix(as.double(as.character(data[,2])), rownames.force = NA)
time = time*1000000 # Time is now in microseconds

len <- nrow(ID)
mat <- matrix(,nrow=len,ncol=2)
for(i in 1:len){
    d <- unlist(strsplit(ID[i], " "))
    mat[i,1] <- as.numeric(gsub('[()]','',d[2]))
    mat[i,2] <- time[i]
}

delay = vector(length=len/2)
k <- 1
for(i in 1:len){
    for(j in i:len){
        if(mat[i,1] == mat[j,1] && mat[j,2] > mat[i,2]){
            delay[k] <- mat[j,2] - mat[i,2]
            k <- k+1
        }
    }
}

The rows in the CSV file are ordered in respect to time and a row looks like this:

"32","1505997726.015245358","10.0.10.70","10.0.10.1","UDP","214","0xa5f0 (42480)","50414  >  5201 Len=172"

where the timestamp is: "1505997726.015245358" and the ID is: "0xa5f0 (42480)"

My question is if I can do this more efficiently to reduce the execution time.

Update: This is a link to one of my CSV files containing the 17000 lines: https://justpaste.it/1bjoy

Here is a smaller file with only 10 lines of data + header. One thing to mention is that it is not true for all files that the duplicate IDs are next to each other in the list.

"No.","Time","Source","Destination","Protocol","Length","Identification","Info"
"120","1505984967.366049706","10.0.0.50","10.0.0.35","UDP","214","0x8dab (36267)","46670  >  5201 Len=172"
"123","1505984967.366440","10.0.0.50","10.0.0.35","UDP","214","0x8dab (36267)","46670  >  5201 Len=172"
"124","1505984967.386478504","10.0.0.50","10.0.0.35","UDP","214","0x8dac (36268)","46670  >  5201 Len=172"
"125","1505984967.386606","10.0.0.50","10.0.0.35","UDP","214","0x8dac (36268)","46670  >  5201 Len=172"
"130","1505984967.406353133","10.0.0.50","10.0.0.35","UDP","214","0x8db0 (36272)","46670  >  5201 Len=172"
"131","1505984967.406555","10.0.0.50","10.0.0.35","UDP","214","0x8db0 (36272)","46670  >  5201 Len=172"
"132","1505984967.426372842","10.0.0.50","10.0.0.35","UDP","214","0x8db1 (36273)","46670  >  5201 Len=172"
"133","1505984967.426558","10.0.0.50","10.0.0.35","UDP","214","0x8db1 (36273)","46670  >  5201 Len=172"
"134","1505984967.446282356","10.0.0.50","10.0.0.35","UDP","214","0x8db6 (36278)","46670  >  5201 Len=172"
"135","1505984967.446555","10.0.0.50","10.0.0.35","UDP","214","0x8db6 (36278)","46670  >  5201 Len=172"

Update 2: The order of the rows must remain as I will perform additional calculations of the new values. The first column "No." indicates the packet number as counted by WireShark and must be increasing while traversing down the list.

You almost certainly can do this more efficiently, either using a self-join on ID, or with some clever use of `lead()`/`lag()` over the IDs to get the send/receive stamps and then calculate the difference. If you post a small example set I can provide some example code. — Jim Leach, Sep 22 '17 at 08:56
@JimLeach : I have added a small sample of my data as well as a link to one of my data files that takes roughly 12 seconds to go through. — Jimmy, Sep 22 '17 at 09:06

bendae · Answer 1 · 2017-09-22T14:14:43.267

0

Here is a fast solution using data.table. The file so_long.csv is this one fromn your edit.

library(data.table)
library(microbenchmark)

foo <- function() {
  dt <- fread("so_long.csv")
  dt[, Time := as.double(as.character(Time)) * 1000000]
  dt[, .(Delay = max(Time) - min(Time)), by = Identification]
}

head(foo())
# Identification   Delay
# 1:     0x0003 (3) 1749.75
# 2:     0x0004 (4) 1761.00
# 3:     0x0007 (7) 1887.50
# 4:     0x0009 (9) 1983.75
# 5:    0x000e (14) 1929.75
# 6:    0x0014 (20) 1948.50

microbenchmark(foo())
# Unit: milliseconds
# expr      min       lq     mean   median       uq      max neval
# foo() 38.28835 52.17356 64.48024 60.63322 72.21627 132.8679   100

edited Sep 22 '17 at 14:14

answered Sep 22 '17 at 13:39

bendae

779
3
13

Thank you, but does this method change the order of the data? With your call to head(foo()) the values are ordered after identification which I can not do in my calculations. The original order according to the column "No." must remain. – Jimmy Sep 22 '17 at 14:02
Edited to not change the order (`setkey` wasn't needed in the first place) – bendae Sep 22 '17 at 14:15

R - match pairs in matrix efficiently

1 Answers1