I have a matrix where I want to split all rows into 20 bins according to row means. I can achieve this as follows:
library(dplyr)
n_bins = 20
data$bin = ntile(rowMeans(data), n_bins)
Now, within each bin, I would like to z-normalize the dispersion measure of all rows within the bin, in order to identify outlier rows. I want to define outliers at having a z-score cutoff of 1.7. I'm not sure if there is an easy way to go about this but I'm currently stuck at this point.
EDIT:
Problem re-stated/clarified: I have a data.frame that is rather large with 12374 rows (genes) and 785 columns (cells). I'd like to group rows according to rowMeans into 20 bins. Within each bin, I'd like to z-normalized the dispersion measure (variance/mean) of all genes within that bin in order to identify outlier genes whose expression values were highly variable even when compared to genes with similar average expression. I would then like to extract out genes which exceed a z-score threshold of 1.7 to identify significantly variable genes from each bin.
> head(temp[,1:5])
Drop7_0_AAACTAGGGTGG Drop7_0_AAAGGACGTACG Drop7_0_AACACTTGAGCC Drop7_0_AAGGCAACGAAT Drop7_0_AATGATGGGGTA
0610007P14RIK 0.1439444 0.0000000 0.000000 0.8759335 0.0000000
0610009B22RIK 0.0000000 0.6776718 0.000000 0.0000000 0.0000000
0610009O20RIK 0.1439444 0.0000000 0.000000 0.2735741 0.0000000
0610010B08RIK 1.4769893 1.1369215 1.124842 0.8759335 1.9544187
0610010F05RIK 0.7944809 0.0000000 0.000000 0.7016789 0.9144108
0610010K14RIK 0.1439444 0.0000000 1.124842 0.7016789 0.0000000
When I run this code:
library(dplyr)
n_bins = 20
temp = data
temp$rowm = rowMeans(temp)
outscore = temp %>% mutate(bin=ntile(rowm,n_bins)) %>%
group_by(bin) %>% mutate(zscore=scale(rowm),outlier=abs(zscore)>1.7)
I get the error: Error: dims [product 619] do not match the length of object [618]
which I think refers to the number of bins in the data.
Any suggestions?