-1

In a data frame, I am trying to look for data points that are more than (threshold * s.d.) away from mean. The dim of the data frame is as follows:

[1] 4032    4

To find the data points for the above condition, I did:

df$mean = rollapply(df$value, width = 2, FUN = mean, align = "right", fill = "extend")

df$sd = rollapply(df$value, width = 2, FUN = sd,  align = "right", fill = "extend")

After the above the head(df) looks like:

            timestamp  value   mean        sd
2007-03-14 1393577520 37.718 38.088 0.5232590
2007-03-15 1393577220 38.458 38.088 0.5232590
2007-03-16 1393576920 37.912 38.185 0.3860803
2007-03-17 1393576620 40.352 39.132 1.7253405
2007-03-18 1393576320 38.474 39.413 1.3279465
2007-03-19 1393576020 39.878 39.176 0.9927779

To find the datapoints:

anomaly = df[df$value > abs((threshold*df$sd + df$mean) | 
                                (df$mean - threshold*df$sd)),]

Is above the correct way to find data points that are more than (threshold * s.d.) away from mean. The reason I am suspicious is that dim of anomaly is same as that of df.

Jatt
  • 665
  • 2
  • 8
  • 20
  • 1
    Why are you using `rollapply` here? Maybe you just want `anomaly = df[abs(df$value - mean(df$value)) > threshold * sd(df$value),]`. When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions – MrFlick Mar 28 '18 at 15:49
  • @MrFlick It returns zero rows. – Jatt Mar 28 '18 at 16:01
  • @MrFlick Updated the `head(df)` – Jatt Mar 28 '18 at 16:04
  • 1
    @Jatt it could then be that there are no anomalies in your dataset. Alternatively you could also try something similar to MrFlick's answer: `anomaly = df[abs(df$value - df$mean) > threshold * sd(df$sd),]` – Mike H. Mar 28 '18 at 16:08
  • @MikeH. This gave 3 rows. Could you explain what does the statement `data points that are more than (threshold * s.d.) away from mean.` really mean? – Jatt Mar 28 '18 at 16:12

1 Answers1

1

This will do it

# creating some dummy data
m <- matrix(runif(16128,-1,1), ncol = 4)

tresh <- .004+1
m[which(abs(m-mean(m)) > tresh*sd(m), arr.ind = T)]

Where m denotes your matrix (or your column value depending on whichever you take the mean/sd) and tresh your treshold.

Update Here are the first couple of entries of my result:

dat <- df$value[which(abs(df$value-mean(df$value)) > tresh*sd(df$value))]
head(dat)
[1] 51.846 48.568 44.986 49.108 53.404 46.314
niko
  • 5,253
  • 1
  • 12
  • 32
  • what does `arr.T` mean? – Jatt Mar 28 '18 at 16:05
  • `arr.ind = TRUE`, it means that which returns the `array index`, i.e. it makes `which` say "in row x, col y there is a match" instead of "elment number X is a match". (But it actually is not needed here, so you can just leave it out) – niko Mar 28 '18 at 16:11
  • The data I am trying on gives out 0 rows. I am sure there are anomalies in the data. – Jatt Mar 28 '18 at 16:36
  • Here is the csv I am trying : http://www.sharecsv.com/s/af385df6de4421fd6094fc758f2d9a3b/ec2_cpu_utilization_5f5533.csv – Jatt Mar 28 '18 at 16:38
  • Works perfectly fine when I do `dat <- df$value[which(abs(df$value-mean(df$value)) > tresh*sd(df$value))]`. See the edit for my result `dat`. – niko Mar 28 '18 at 16:43
  • Okay. I am trying for tresh=4 and gives out 2 results. Earlier I was trying like `p = tresh * sd(df$value) m = mean(df$value) df[df$value > p+m,]` Is it incorrect than what you suggested? – Jatt Mar 28 '18 at 17:01
  • 1
    Of course, the result depends on `tresh`. The problem with `df$value > p+m` is that you do not take into account the instances where `m-df$value > p`, i.e. `df$value < m-p` – niko Mar 28 '18 at 17:24
  • Is there a way I could use `rolling mean` and `rolling standard deviation` in place of global mean and sd? – Jatt Mar 29 '18 at 16:17
  • @Jatt I am not familiar with `rolling mean / sd`. If these are functions then you could probably just swap them for `mean / sd` – niko Mar 31 '18 at 21:56