0

I have a R data frame with 3 columns

  1. timestamp
  2. Category
  3. Value

I am trying to find an elegant way(ideally) to find where the values increased, or decreased, by X percent within a specified time frame. For example, I'd like to know all points in the data where Value increased by 50% or more within 1 week.

Are there any built in funcitons of packages where I can just pass a percentage and a number of days and have it return which rows in the data frame are a match?

something along these lines(pseudo code below):

RowsThatareAMatch <- findmatches(date=MyDF$Timestamp, grouping=MyDF$Category, data=MyDF$Value, growth=0.5, range=7)

The thing that is throwing me off is that I want it returning the rows for each Category that has values, and not just look at every value in the data frame. So if Category A & B had growth of 50% or more within 7 days 8 times in my data, I want those rows returned, and if categories C, D, & E didn't every have that kind of increase I don't want data from those categories returned at all.

Right now I am looking at systematically splitting the data frame into multiple data frames for each category and then doing the analysis on each individual data frame. While that approach could work, something is telling me that R has an easier way to do this.

Thoughts?

edit: Ideally what I am looking for returned is a data frame with 3 columns, and 1 row for every match in my data.

  1. Category
  2. Start timestamp of the match
  3. End timestamp of the match.

Based in my experience with R I would need to identify the row numbers for each grouping and then I could extract the above data from the original data frame, but if there's any good way to go straight to the above output that would be awesome too!

Sample Data

So I have a CSV like this:

Timestamp,Category,Value
2015-01-01,A,1
2015-01-02,A,1.2
2015-01-03,A,1.3
2015-01-04,A,8
2015-01-05,A,8.2
2015-01-06,A,9
2015-01-07,A,9.2
2015-01-08,A,10
2015-01-09,A,11
2015-01-01,B,12
2015-01-02,B,12.75
2015-01-03,B,15
2015-01-04,B,60
2015-01-05,B,62.1
2015-01-06,B,63
2015-01-07,B,12.3
2015-01-08,B,10
2015-01-09,B,11
2015-01-01,C,100
2015-01-02,C,100000
2015-01-03,C,200
2015-01-04,C,350
2015-01-05,C,780
2015-01-06,C,780.2
2015-01-07,C,790
2015-01-08,C,790.3
2015-01-09,C,791
2015-01-01,D,0.5
2015-01-02,D,0.8
2015-01-03,D,0.83
2015-01-04,D,2
2015-01-05,D,0.01
2015-01-06,D,0.03
2015-01-07,D,0.99
2015-01-08,D,1.23
2015-01-09,D,5

I would read that into R like this

df <- read.csv("CategoryMeasurements.csv", header=TRUE)
user3246693
  • 679
  • 11
  • 22
  • Yes, you can do this. No there is not a built-in function... You are correct that a split-apply-combine approach is the right way to go. If you give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) contributors can provide answers. At this point it's all hypothetical. – alexwhitworth Dec 04 '15 at 16:30

2 Answers2

0

Say you your data.frame is called df, you can do something like this using data.table which creates a new row that reads "increase over 50%" if the value grew by 50% or more (which you can then filter):

lag <- function(x, n) c(rep(NA, n), x[1:(length(x) - n)])

library(data.table)
setDT(df)[, ifelse(value/lag(value, 1) - 1 > 0.5, "increase over 50%", "Other"), by = category]
David
  • 9,216
  • 4
  • 45
  • 78
  • This works for comparing each row to the previous row based on category, but I need to know if the increase over X days is more than 50%, and where the start/end dates of that increase. – user3246693 Dec 04 '15 at 17:01
0

Well I'm not sure how elegant this is, but it works, and I wound up having to subset by category before passing the data frame to my function and will need to create a loop or use one of the apply functions to pass each category to my function, but it should get the job done.

Mydf <- read.csv("CategoryMeasurements.csv", header=TRUE)

GetIncreasesWithinRange <- function(df, growth, days ) {
  # df     = data frame with data you want processed.  1st column should be a date, 2nd column should be the data.
  # growth = % of growth you are looking for in the data
  # days   = the number of days that the growth should occur in to be a match.

  df <- df[order(df[,1]), ] # Sort the df by the date column.  This is important for the loop logic.

  # Initialize empty data frame to hold results that will be returned from this funciton.
  ReturnDF <- data.frame( StartDate=as.Date(character()),
                      EndDate=as.Date(character()),
                      Growth=double(), 
                      stringsAsFactors=FALSE)

  TotalRows = nrow(df)
  for(i in 1:TotalRows) {
    StartDate  <- toString(df[i,1])
    StartValue <- df[i,2]
    for(x in i:(TotalRows)) {
      NextDate  <- toString(df[x,1])
      DayDiff <- as.numeric(difftime(NextDate ,StartDate , units = c("days")))
      if(DayDiff >= days) { 
        NextValue <- df[x,2]
        PercentChange = (NextValue - StartValue)/NextValue
        if(PercentChange >= growth) {
          ReturnDF[(nrow(ReturnDF)+1),] <- list(StartDate, NextDate, PercentChange)
        }
        break
      }
    }
  }
  return(ReturnDF)
}

subDF <- Mydf[which(Mydf$Category=='A'), ]
subDF$Category <- NULL # Nuke the category column from the subsetting DF.  It's not relevant for this.

X <- GetIncreasesWithinRange(subDF, 0.5, 4)
print(X)

Which outputs

   StartDate    EndDate    Growth
1 2015-01-01 2015-01-05 0.8780488
2 2015-01-02 2015-01-06 0.8666667
3 2015-01-03 2015-01-07 0.8586957
user3246693
  • 679
  • 11
  • 22