0

This question is an extension of How can I sum rows that with non-numeric factor in R?. I have data frame in data.txt looking like:

        Latency     Port        TrafficType     Time
    1   27821       Port1       ssh     "2016/02/05 15:18:25"
    2   24186       Port1       http    "2016/02/05 15:18:25"
    3   17963       Port1       ssh     "2016/02/05 15:18:25"
    4   20208       Port1       ftp     "2016/02/05 15:18:25"
    5   20703       Port2       ftp     "2016/02/05 15:18:25"
    6   29735       Port3       ssh     "2016/02/05 15:18:25"
    7   20975       Port1       https   "2016/02/05 15:18:25"
    8   29489       Port1       ssh     "2016/02/05 15:18:25"
    9   19319       Port4       ssh     "2016/02/05 15:18:25"
    10  18224       Port1       ssh     "2016/02/05 15:18:25"
    11  17952       Port1       ftp     "2016/02/05 15:18:25"
    12  17972       Port1       ssh     "2016/02/05 15:18:25"
    13  17300       Port1       ssh     "2016/02/05 15:18:25"
    14  20937       Port1       ssh     "2016/02/05 15:18:25"
    15  18769       Port1       ssh     "2016/02/05 15:18:25"
    16  18104       Port2       ssh     "2016/02/05 15:18:25"
    17  17496       Port2       ssh     "2016/02/05 15:18:26"
    18  23268       Port1       https   "2016/02/05 15:18:26"
    19  19457       Port1       ssh     "2016/02/05 15:18:26"
    20  20937       Port1       ssh     "2016/02/05 15:18:25"
    21  18769       Port1       ssh     "2016/02/05 15:18:25"
    22  18104       Port2       ssh     "2016/02/05 15:18:25"
    23  17496       Port2       ssh     "2016/02/05 15:18:26"
    24  23268       Port1       https   "2016/02/05 15:18:26"
    25  19457       Port1       ssh     "2016/02/05 15:18:27"
    ....

I used tapply() to do some statistics:

data <- read.table("data.txt")
fact <- factor(data$Port)
lat <- tapply(data$Latency, fact,
           function(x) {
               c(max(x),
                 mean(x),
                 median(x),
                 quantile(x, c(0.90,0.99,0.9999)))
           })

Then I got:

    $Port1
                                    90%      99%   99.99% 
    29489.00 20941.78 19832.50 25276.50 29205.44 29486.16 

    $Port2
                                    90%      99%   99.99% 
    20703.00 18380.60 18104.00 19663.40 20599.04 20701.96 

    $Port3
                           90%    99% 99.99% 
     29735  29735  29735 29735  29735  29735 

    $Port4
                           90%    99% 99.99% 
     19319  19319  19319 19319  19319  19319

I wanted to append more statistics to the table above, like this:

    $Port1
                                   90%      99%   99.99% ftp http https ssh peak
    29489.00 20941.78 19832.50 25276.50 29205.44 29486.16 2   1   3     12   14

    $Port2
                                    90%      99%   99.99% ftp http https ssh peak
    20703.00 18380.60 18104.00 19663.40 20599.04 20701.96 1    0     0    4    3

    $Port3
                           90%    99% 99.99% ftp http https ssh peak
     29735  29735  29735 29735  29735  29735 ?   ?    ?     ?   ?

    $Port4
                           90%    99% 99.99% ftp http https ssh peak
     19319  19319  19319 19319  19319  19319 ?   ?    ?     ?   ?

yesterday, I asked in How can I sum rows that with non-numeric factor in R?, thanks to @akrun who taught me an approach applying table() function on the subset of data to get the counts of all traffic types:

     t <- table(data[c("Port", "TrafficType")])
     t
                    TrafficType
     Port    ftp http https ssh
      Port1   2    1     3  12
      Port2   1    0     0   4
      Port3   0    0     0   1
      Port4   0    0     0   1

Now, my question is:

  1. how can I append this result to the table (after the 99.99% column)?

  2. how can I compute the peak flow rate (flows/second) for each port? I.e., Port1 has 14 flows in 2016/02/05 15:18:25, 3 flows in 2016/02/05 15:18:26 and 1 in 2016/02/05 15:18:27, so its peak, I need a number 14 in the place.

Hopefully I described my question clear enough. Thanks a lot for your patience and kind response.

Updated: I found an ugly approach, that is computing the msg rate seperately:

    rate_df <- as.data.frame(data[c("Port", "Time")])
    rate_fc <- factor(rate_df$Port)
    peak <- tapply(rate_df$Freq, rate_fc, max) # <-

then using print function to append the peak's values after latency. It looks so ugly. Need experts' advises here. Thanks a lot.

Community
  • 1
  • 1
Luke Huang
  • 13
  • 4
  • Modify your anonymous function call. – alexwhitworth Feb 10 '16 at 17:43
  • @Alex, totally no idea how to, just started learning R for a couple weeks. – Luke Huang Feb 15 '16 at 21:21
  • @LukeHuang The anonymous function call Alex is referring to is the call to `function` in your second block of code. You can read more about anonymous functions here: http://adv-r.had.co.nz/Functional-programming.html. What Alex is suggesting is to add those statistics to your rows as you create them. – Empiromancer Feb 15 '16 at 21:31
  • Thansk @user164385 I see your point, and will try. – Luke Huang Feb 15 '16 at 21:38
  • To be honest, it's not the approach I'd actually recommend. It's a quick and dirty fix that will do what you want, but (given how your data looks and what you're trying to do with it) I think you'll have an easier time in the long run if you put your summary data in a data frame rather than a table or some other data structure. I'll write an answer elaborating. – Empiromancer Feb 15 '16 at 21:54
  • 1
    @LukeHuang SO is not a programming service. It is a Q/A site for programmers to work as a community. It is taken as a given that users will make an effort to **learn** programming. Based on your comments, I'm not convinced you are trying. Take the time to learn what an anonymous function is and how to use and modify them. – alexwhitworth Feb 15 '16 at 22:08

1 Answers1

0

If you just want to hack together something that works right now, @Alex's comment about modifying your anonymous function call in the second code block of your question will do the job for you. However, in the interest of helping you out more long term, I'd instead recommend turning that table of yours into a data frame. It's practically crying out to be one anyway.

It's quite easy to add new columns to a data frame d; just use d$new_column_name <- vector_of_values or d[,"new_column_name"] <- vector_of_values.

You can also turn the table t that @akrun taught you how to make into a data frame using as.data.frame(t) and glue the two together: as long as two data frames a and b have the same number of rows, cbind(a, b) will produce a data table with the columns of both a and b. (As a side note it's a good idea not to use t as the name of an object for clarity and readability of code, since t is also the name of the transpose function).

Empiromancer
  • 3,778
  • 1
  • 22
  • 53
  • My personal opinion is that this doesn't qualify as an answer, but should be moved to the comment section above... that said, it's just an opinion. – alexwhitworth Feb 15 '16 at 22:10
  • @Alex I can see how it's a bit of an edge case. I figure that it's better as an answer than a comment because it includes specific suggestions about code OP could use to solve their problem, though since most of the guidelines on answer vs. comment are based on tautologies ("it's an answer if it answers the question") I'm also going with my gut somewhat. – Empiromancer Feb 15 '16 at 22:22