0

I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:

epochtime, uri, cache (hit or miss) 

I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI

I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get

EDIT:

I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer

time                uri                 action
1355683900          /some/uri           TCP_HIT
1355683900          /some/other/uri     TCP_HIT 
1355683905          /some/other/uri     TCP_MISS
1355683906          /some/uri           TCP_MISS
mnel
  • 113,303
  • 27
  • 265
  • 254
rjb101
  • 514
  • 5
  • 14

3 Answers3

4

You are looking for the aggregate function.

Call your data frame u:

> u
        time             uri   action
1 1355683900       /some/uri  TCP_HIT
2 1355683900 /some/other/uri  TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906       /some/uri TCP_MISS

Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:

ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
         FUN=function(x) sum((2-as.numeric(x))/length(x)))

Now use lapply to get the final result:

lapply(seq_along(levels(u$uri)),
    function(l) list(uri=levels(u$uri)[l],
     hits=ratio(u[as.numeric(u$uri) == l,])))


[[1]]
[[1]]$uri
[1] "/some/other/uri"

[[1]]$hits
  u$time%/%10 u$action
1   135568390      0.5


[[2]]
[[2]]$uri
[1] "/some/uri"

[[2]]$hits
  u$time%/%10 u$action
1   135568390      0.5

Or otherwise filter the data frame by URI before computing the ratio.

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • Thanks. Now I'll go try to figure out exactly what you did and how to plot it. :) I do see that I have a list to work with now though – rjb101 Dec 16 '12 at 20:33
  • I'm not sure this is working. I ran this against a 12M obs. dataseet and instead of a % like you show above I get: `[[925]] [[925]]$uri [1] "env/service/2/method/blah" [[925]]$hits u$time%/%10 u$action 1 135561363 -3 2 135561382 -3 3 135561386 -3 4 135561473 -3 5 135561507 -7` – rjb101 Dec 17 '12 at 17:43
  • Nevermind, ran this on another dataset and I'm getting numbers that look correct. I'm stuck on plotting this though. R list seem to behave differently that in any other language. What I want to do is for every item in the list, plot the nested list. `> h[1] [[1]] [[1]]$uri [1] "/service/0/method" [[1]]$hits u$time%/%10 u$action 1 135561701 0 2 135561707 0 3 135561710 0 4 135561713 0` any suggestions would be appreciated, – rjb101 Dec 17 '12 at 22:17
2

@MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.

Given the size of your data, though, I'd take a look at the data.table package.

You can see why visually here--data.table is just faster.

Community
  • 1
  • 1
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
0

Thought it would be useful to share my solution to the plotting part of them problem.

My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.

for ( i in 1:length(h)) {
  name <- unlist(h[[i]][1])  
  dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
  names(dftemp) <-  c("time", "cache")
  plot(dftemp$time,dftemp$cache, type="o")
  title(main=name)
}
rjb101
  • 514
  • 5
  • 14