3

I am working with RNA-seq data. I want to filter out genes where there are fewer than N counts in both replicates in at least one of my treatment groups.

My data is in a DESeq object, and the count data is structured like this, where each row is the gene, and each column a different sample. Sample names have the structure X|N|A|1/2 (where X is the cell line used, N is a 1 or 2 digit number reflecting treatment length, A is a letter representing the treatment group, and 1 or 2 indicates which replicate.

X1A1 <- c(117, 24, 45, 146, 1)
X1A2 <- c(129, 31, 58, 159, 0)
X1B1 <- c(136, 25, 50, 1293, 0)
X1B2 <- c(131, 24, 50, 1073, 4)
X1C1 <- c(113, 23, 43, 132, 0)
X1C2 <- c(117, 18, 43, 126, 0)
X1D1 <- c(101, 20, 0, 875, 1)
X1D2 <- c(99, 21, 38 , 844, 0)
X24A1 <- c(109, 17, 60, 95, 0)
X24A2 <- c(122, 14, 611, 90, 0)

df <- data.frame(X1A1, X1A2, X1B1, X1B2, X1C1, X1C2, X1D1, X1D2, X24A1, X24A2)
rownames(df) <- c("geneA", "geneB", "geneC", "geneD", "geneE")
df

Maybe I'm just not using the right search terms, but I can't figure out how to get what I need.

Right now I only know how to filter out genes that are not expressed below some threshold in all samples. For example, filtering out genes which are not expressed at all.

keep1 <- rowSums(df) > 1
df1 <- df[keep1,]

What I want is to refine this so that I would end up discarding geneE in my example set, because no group has counts above 0 for both replicates.

df2 <- df[1:4,]
df2

2 Answers2

4

The logic should be applied on the 'df' itself to create a logical matrix, then when we do rowSums, it counts the number of TRUE (or 1) values, then use that to do the second condition i.e. at least more than one TRUE (> 1)

df[rowSums(df > 1) > 1,]

-output

      X1A1 X1A2 X1B1 X1B2 X1C1 X1C2 X1D1 X1D2 X24A1 X24A2
geneA  117  129  136  131  113  117  101   99   109   122
geneB   24   31   25   24   23   18   20   21    17    14
geneC   45   58   50   50   43   43    0   38    60   611
geneD  146  159 1293 1073  132  126  875  844    95    90

If it is by groups of columns, then either the | (OR) or & (AND) could work. Not clear from the comments

df[Reduce(`&`, lapply(split.default(df, sub("\\d+$", "", 
     names(df))), function(x) rowSums(x > 1) > 1)),]
      X1A1 X1A2 X1B1 X1B2 X1C1 X1C2 X1D1 X1D2 X24A1 X24A2
geneA  117  129  136  131  113  117  101   99   109   122
geneB   24   31   25   24   23   18   20   21    17    14
geneD  146  159 1293 1073  132  126  875  844    95    90

df[Reduce(`|`, lapply(split.default(df, sub("\\d+$", "", 
      names(df))), function(x) rowSums(x > 1) > 1)),]
      X1A1 X1A2 X1B1 X1B2 X1C1 X1C2 X1D1 X1D2 X24A1 X24A2
geneA  117  129  136  131  113  117  101   99   109   122
geneB   24   31   25   24   23   18   20   21    17    14
geneC   45   58   50   50   43   43    0   38    60   611
geneD  146  159 1293 1073  132  126  875  844    95    90
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @akron Thanks, but this doesn't appear to take into account the groups. If I change it to X1A1 <- c(117, 24, 45, 146, 2), then geneE is included in your solution. I want it to return genes where counts > N in both X1A1 & X1A2 (and/or X1B1 & X1B2, and/or X1C1 & X1C2, etc.). At least one (but can be more) of the groups (1A, 1B, 1C, 1D, or 24A) must have counts > N in both replicates for the gene to be kept. – FutureFellwalker Oct 28 '22 at 20:55
  • @FutureFellwalker try the update. Not clear from your comments what the expected would be – akrun Oct 29 '22 at 16:56
0

Here is another option. If your data is always structured the way you show, then this function will evaluate that all replicates have at least 1 value greater than N.

library(tidyverse)

#example Data
df
#>       X1A1 X1A2 X1B1 X1B2 X1C1 X1C2 X1D1 X1D2 X24A1 X24A2
#> geneA  117  129  136  131  113  117  101   99   109   122
#> geneB   24   31   25   24   23   18   20   21    17    14
#> geneC   45   58   50   50   43   43    0   38    60   611
#> geneD  146  159 1293 1073  132  126  875  844    95    90
#> geneE    1    0    0    4    0    0    1    0     0     0


filter_fewer <- function(data, N){
  gene_list <- data |>
    rownames_to_column("gene") |>
    pivot_longer(cols = -gene, 
                 names_to = c("var", "type", "rep"),
                 names_pattern = "(\\w\\d+)(\\w)(\\d+)") |>
    pivot_wider(names_from = rep, values_from = value, names_glue = "rep_{rep}") |>
    group_by(gene, var, type) |>
    mutate(flag = if_any(c(rep_1, rep_2), ~.>N)) |>
    ungroup() |>
    group_by(gene) |>
    filter(sum(flag) == n()) |>
    pull(gene)|>
    unique()

  filter(data, row.names(data) %in% gene_list)
}

#your example
filter_fewer(df, 1)
#>       X1A1 X1A2 X1B1 X1B2 X1C1 X1C2 X1D1 X1D2 X24A1 X24A2
#> geneA  117  129  136  131  113  117  101   99   109   122
#> geneB   24   31   25   24   23   18   20   21    17    14
#> geneC   45   58   50   50   43   43    0   38    60   611
#> geneD  146  159 1293 1073  132  126  875  844    95    90

#remove geneB because neither A24A rep 1 or 2 are > 20
filter_fewer(df, 20)
#>       X1A1 X1A2 X1B1 X1B2 X1C1 X1C2 X1D1 X1D2 X24A1 X24A2
#> geneA  117  129  136  131  113  117  101   99   109   122
#> geneC   45   58   50   50   43   43    0   38    60   611
#> geneD  146  159 1293 1073  132  126  875  844    95    90
AndS.
  • 7,748
  • 2
  • 12
  • 17
  • This works with my example data set, but not with my real dataset. I'm not familiar with all the arguments in your function, so I'm not entirely sure what to change. But the error I'm getting is: "Error in UseMethod("filter") : no applicable method for 'filter' applied to an object of class "c('matrix', 'array', 'integer', 'numeric')" – FutureFellwalker Oct 28 '22 at 20:57
  • it sounds like your df is actually a matrix and not a dataframe. – AndS. Oct 28 '22 at 22:51