How to bin rows by mean then and compute dispersion to identify outliers

Question

I have a matrix where I want to split all rows into 20 bins according to row means. I can achieve this as follows:

library(dplyr)
n_bins = 20
data$bin = ntile(rowMeans(data), n_bins)

Now, within each bin, I would like to z-normalize the dispersion measure of all rows within the bin, in order to identify outlier rows. I want to define outliers at having a z-score cutoff of 1.7. I'm not sure if there is an easy way to go about this but I'm currently stuck at this point.

EDIT:

Problem re-stated/clarified: I have a data.frame that is rather large with 12374 rows (genes) and 785 columns (cells). I'd like to group rows according to rowMeans into 20 bins. Within each bin, I'd like to z-normalized the dispersion measure (variance/mean) of all genes within that bin in order to identify outlier genes whose expression values were highly variable even when compared to genes with similar average expression. I would then like to extract out genes which exceed a z-score threshold of 1.7 to identify significantly variable genes from each bin.

> head(temp[,1:5])
              Drop7_0_AAACTAGGGTGG Drop7_0_AAAGGACGTACG Drop7_0_AACACTTGAGCC Drop7_0_AAGGCAACGAAT Drop7_0_AATGATGGGGTA
0610007P14RIK            0.1439444            0.0000000             0.000000            0.8759335            0.0000000
0610009B22RIK            0.0000000            0.6776718             0.000000            0.0000000            0.0000000
0610009O20RIK            0.1439444            0.0000000             0.000000            0.2735741            0.0000000
0610010B08RIK            1.4769893            1.1369215             1.124842            0.8759335            1.9544187
0610010F05RIK            0.7944809            0.0000000             0.000000            0.7016789            0.9144108
0610010K14RIK            0.1439444            0.0000000             1.124842            0.7016789            0.0000000

When I run this code:

library(dplyr)
n_bins = 20
temp = data
temp$rowm = rowMeans(temp)
outscore = temp %>% mutate(bin=ntile(rowm,n_bins)) %>% 
  group_by(bin) %>% mutate(zscore=scale(rowm),outlier=abs(zscore)>1.7)

I get the error: Error: dims [product 619] do not match the length of object [618] which I think refers to the number of bins in the data.

Any suggestions?

Giving a good reproducible example will help us help you http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Hack-R, Jun 24 '16 at 20:17
Maybe use `boxplot(..., plot = FALSE)`? Or `hist` for breaks, if you like. — alistaire, Jun 24 '16 at 20:21
@Hack-R, I've edited my OP to show a bit of the data I'm working with and the code I have. Any insight would be greatly appreciated. — user2117258, Jun 24 '16 at 23:57
@user2117258 Great, thanks for the extra effort. If the current answers don't answer your question I'll take a stab at it. — Hack-R, Jun 25 '16 at 00:11
@Hack-R, please! Unfortunately I'm still stuck on this problem. I will restate the problem in the OP again for clarification. — user2117258, Jun 25 '16 at 00:16
@Hack-R I think that refers to the number of bins (12374/20=618.7) created. — user2117258, Jun 25 '16 at 00:45
@user2117258 if you see the `str(outscore)` you can tell that zscore is the whole scale object hence the error. if you replace the last `mutate` for my `ave` solution it will work. see option 3 of my answer — Altons, Jun 25 '16 at 08:37

Altons · Answer 1 · 2016-06-25T08:38:59.657

1

I think I understood your question. Assume that Species are you bins already created. You can use scale to calculate your standardized scores.

     data(iris)
      iris %>% select(Species,Sepal.Length) %>%
 group_by(Species) %>% mutate(zscore=scale(Sepal.Length))

you get zscore by bin or this case by Species

Species Sepal.Length      zscore
    (fctr)        (dbl)       (dbl)
1   setosa          5.1  0.26667447
2   setosa          4.9 -0.30071802
3   setosa          4.7 -0.86811050
4   setosa          4.6 -1.15180675
5   setosa          5.0 -0.01702177
6   setosa          5.4  1.11776320
7   setosa          4.6 -1.15180675
8   setosa          5.0 -0.01702177
9   setosa          4.4 -1.71919923
10  setosa          4.9 -0.30071802
..     ...          ...         ...

from there you can create a flag to highlight those rows gt abs(1.7)

OPTION 2:

Transpose all your cols to rows and calculate z-score by group.

  data(iris)
 w <-  iris %>% select(Species,Sepal.Length:Petal.Length) %>%
   gather(features,values,Sepal.Length:Petal.Length) %>% select(-features)
 w$z <- ave(w$values, w$Species, FUN=scale)

Option 3

 library(dplyr)
 n_bins = 20
 temp = data
 temp$rowm = rowMeans(temp)
 outscore = temp %>% mutate(bin=ntile(rowm,n_bins)) 
 outscore$zscore <- ave(outscore$vrowm, outscore$bin, FUN=scale)

Hope it helps

edited Jun 25 '16 at 08:38

answered Jun 24 '16 at 20:20

Altons

1,422
3
12
23

Great, but where is dispersion computed? Also is `scale()` equivalent to computing dispersion across all columns with a row? The way my data is set up is that I have 200 columns and 5001 rows, where the last row is the bin number. – user2117258 Jun 24 '16 at 20:25
well you did not specify this in your original question - if you can elaborate a bit more then we can help you better. you calculate the `rowMeans` using `ntile` as per your question and then use `scale` to get your z-scores using the `group_by` statement – Altons Jun 24 '16 at 20:28
you can also transpose your 200 cols to rows so the `scale` will work without to much effort. will give you an example soon. – Altons Jun 24 '16 at 20:30
Added a 2nd option just in case – Altons Jun 24 '16 at 21:04
Still having issues trying to apply this. I want to z-normalize the dispersion measure of all rows within the bin and define rows (within each bin) with abs(z-scores) > 1.7. Dispersion in this case is defined as variance/mean. – user2117258 Jun 24 '16 at 21:28
I've edited my OP, please have a look! I feel like I'm close but running into issues with the bins. – user2117258 Jun 24 '16 at 23:56
I'm still struggling to understand how you're computing dispersions (mean/variance) for each bin. The way I understand your code is that you're binning row means, then scaling the rowmeans. I ultimately want to bin by row means, then scale the dispersions for each bin and identify outliers. – user2117258 Jun 28 '16 at 17:04

Max Candocia · Answer 2 · 2016-06-24T21:46:18.247

1

Similar to Alton's answer:

library(dplyr)

n_bins = 20
#making sample data
data = as.data.frame(rbind(replicate(100,rnorm(1000))))

data$rowm = rowMeans(data)

outscore = data %>% mutate(bin=ntile(rowm,n_bins)) %>% 
  group_by(bin) %>% mutate(zscore=scale(rowm),outlier=abs(zscore)>1.7)

scale normalizes the distribution of the row means so that the overall standard deviation is 1. "Outliers", in this case with a z-score with magnitude greater than 1.7, are marked in the outlier column.

If you want to look at which rows have abnormal variance, you can do something like this:

outscore$varscore = apply(outscore[,grepl("^V[0-9]+",names(outscore))],1,var)

outscore = outscore %>% mutate(zscore_var = scale(varscore),
  var_outlier = abs(zscore_var) > 1.7)

If you want to use the row mean bins, you can use that grouping, too:

outscore$varscore_grouped = outscore %>% group_by(bin) %>% 
  select(.,starts_with('V')) %>% apply(1,var)

outscore = outscore %>% mutate(zscore_var_grouped = scale(varscore_grouped), 
  var_group_outlier = abs(zscore_var_grouped) > 1.7)

edited Jun 24 '16 at 21:46

answered Jun 24 '16 at 20:30

Max Candocia

4,294
35
58

Thank you for your response. I want to z-normalize the dispersion measure of all rows within the bin. Dispersion in this case is defined as variance/mean. – user2117258 Jun 24 '16 at 21:04
I added a couple more examples on how to use measure outliers in terms of variance instead of mean. – Max Candocia Jun 24 '16 at 21:46
I cannot seem to get my row names to appear from outscore. I run into the error: `Error: dims [product 619] do not match the length of object [618]` – user2117258 Jun 24 '16 at 21:51
Did you use `rownames(outscore)`? – Max Candocia Jun 24 '16 at 22:09
I'm not able to make the outscore data frame because of the row names in my initial `data` data frame. Im updating my OP. – user2117258 Jun 24 '16 at 23:14
I would greatly appreciate any help you could toss my way! – user2117258 Jun 24 '16 at 23:52

How to bin rows by mean then and compute dispersion to identify outliers

2 Answers2