3

I have network traffic data in the following for for each hour of a ten day period as follows in a R dataset.

   Day   Hour         Volume          Category
    0    00            100            P2P
    0    00            50             email
    0    00            200            gaming
    0    00            200            video
    0    00            150            web
    0    00            120            P2P
    0    00            180            web
    0    00            80             email
    ....
    0    01            150            P2P
    0    01            200            P2P
    0    01             50            Web
    ...
    ...
    10   23            100            web
    10   23            200            email
    10   23            300            gaming
    10   23            300            gaming

As seen there are repetition of Category within a single hour also. I need to calculate the volatility and the peak hour to average hour ratios of these different application categories.

Volatility: Standard deviation of hourly volumes divided by hourly average.

Peak hour to avg. hour ratio: Ratio of volume of the maximum hour to the vol. of the average hour for that application.

So how do I aggregate and calculate these two statistics for each category? I am new to R and don't have much knowledge of how to aggregate and get the averages as mentioned.

So, the final result would look something like this where first the volume for each category is aggregated on a single 24 hour period by summing the volume and then calculating the two statistics

Category    Volatility      Peak to Avg. Ratio
Web            0.55            1.5
P2P            0.30            2.1
email          0.6             1.7
gaming         0.4             2.9

Edit: plyr got me as far as this.

stats = ddply(
    .data = my_data
    , .variables = .( Hour , Category)
    , .fun = function(x){
        to_return = data.frame(
            volatility = sd((x$Volume)/mean(x$Volume))
            , pa_ratio = max(x$Volume)/mean(x$Volume)
        )
        return( to_return )
    }
)

But this is not what I was hoping for. I want the statistics per Category where all the hours of the days are aggregated first into 24 hours by summing the volumes and then the volatility and PA ratio calculated. Any suggestions for improvement?

sfactor
  • 12,592
  • 32
  • 102
  • 152
  • I suggest you have a look at the plyr library http://had.co.nz/plyr/ - this was designed to solve exactly this type of problem. There are many examples of plyr solutions on stackoverflow: http://stackoverflow.com/questions/tagged/plyr – Andrie Feb 20 '11 at 20:48
  • @Andrie I've added the code using plyr but I got the aggregation part wrong. Please look at the edited question. – sfactor Feb 20 '11 at 21:11
  • Thank you for posting some code. That makes it a lot easier to respond. I have posted my solution below. PS. You can make life even easier if you also paste some code to create dummy data, since this means every person who posts an answer can work off the same example. – Andrie Feb 20 '11 at 21:44

1 Answers1

1

You'd need to do it in two stages (using the plyr package): First, as you pointed out, there can be multiple Day-Hour combos for the same category, so we first aggregate, for each category, its totals within each Hour, regardless of the day:

df1 <- ddply( df, .(Hour, Category), summarise, Volume = sum(Volume))

Then you get your stats:

> ddply(df1, .(Category), summarise,
+            Volatility = sd(Volume)/mean(Volume),
+            PeakToAvg = max(Volume)/mean(Volume) )

  Category Volatility PeakToAvg
1      P2P  0.3225399  1.228070
2      Web         NA  1.000000
3    email  0.2999847  1.212121
4   gaming  0.7071068  1.500000
5    video         NA  1.000000
6      web  0.7564398  1.534884
Prasad Chalasani
  • 19,912
  • 7
  • 51
  • 73