2

I am using the dplyr package in R for filtering my data of gene expressions. I have calculated fold changes and would like to filter the genes (rows) in which at least one sample (columns) has a value greater than +0.584963 OR less than -0.584963.An example data:

       X SAMPLE_1_FC SAMPLE_2_FC SAMPLE_3_FC SAMPLE_4_FC SAMPLE_5_FC
GENE_1      0.6780      0.4050      0.8870      0.3300      0.2230
GENE_2      0.2340     -0.6670      0.0020      0.1240      0.3560
GENE_3      0.0170      0.1560      0.1120      0.0080     -0.1230
GENE_4     -0.0944     -0.1372     -0.1800     -0.2228     -0.2656
GENE_5     -0.8080     -0.7800     -0.5560      0.0340      0.4450
GENE_6      0.2091      0.1106      0.0121     -0.0864     -0.1849
GENE_7      0.5980      0.7680      0.9970      0.4670     -0.7760

I am currently using the following script

det.cols<- colnames(my.data)[which(grepl("fc",tolower(colnames(my.data))))]
filt <- gsub(","," | ",toString(paste("`",det.cols,"`",">abs(0.584963)", sep = "")))
my.datasub<- my.data %>% filter_(filt)

but this returns only the genes greater than +0.584963 and not the negative ones. In the case of the example, what I want is a subsetted list with Genes 1, 2, 5 and 7. But instead it gives me only Genes 1 and 7. How can I change this?

I am expecting the answer to be in this format:

 X SAMPLE_1_FC SAMPLE_2_FC SAMPLE_3_FC SAMPLE_4_FC SAMPLE_5_FC
GENE_1      0.6780      0.4050      0.8870      0.3300      0.2230
GENE_2      0.2340     -0.6670      0.0020      0.1240      0.3560
GENE_5     -0.8080     -0.7800     -0.5560      0.0340      0.4450
GENE_7      0.5980      0.7680      0.9970      0.4670     -0.7760

Thanks.

Sayan28
  • 53
  • 1
  • 6
  • 2
    please provide some sample data for a fully reproducible example. thanks! – Arthur Yip Feb 28 '19 at 07:08
  • 1
    Thank you Arthur Yip. I have edited the question. Hope it helps. – Sayan28 Feb 28 '19 at 23:24
  • Oh I finally spotted the minor error in your code. I fixed it in the answer below. – Arthur Yip Mar 01 '19 at 09:19
  • Looks like the custom filter code came from https://stackoverflow.com/questions/43981187/is-there-a-way-to-select-rows-from-different-columns-with-similar-suffixes-in-a - would be helpful next time to link to source material – Arthur Yip Mar 04 '19 at 05:34

3 Answers3

2

Here's a solution that is flexible to the number of samples and data rows. It involves transforming the data into long format and then filters for the gene and specific sample. I tested it on 50k genes and 35 samples, and it ran in < 1 second.

library(tidyverse)

# set up sample data with 50000 rows
mydata <- data.frame(stringsAsFactors=FALSE,
                     X = c("GENE_1", "GENE_2", "GENE_3", "GENE_4", "GENE_5", "GENE_6", "GENE_7", 1:50000),
                     SAMPLE_1_FC = c(0.678, 0.234, 0.017, -0.0944, -0.808, 0.2091, 0.598, rnorm(50000, 0, 1)),
                     SAMPLE_2_FC = c(0.405, -0.667, 0.156, -0.1372, -0.78, 0.1106, 0.768, rnorm(50000, 0, 1)),
                     SAMPLE_3_FC = c(0.887, 0.002, 0.112, -0.18, -0.556, 0.0121, 0.997, rnorm(50000, 0, 1)),
                     SAMPLE_4_FC = c(0.33, 0.124, 0.008, -0.2228, 0.034, -0.0864, 0.467, rnorm(50000, 0, 1)),
                     SAMPLE_5_FC = c(0.223, 0.356, -0.123, -0.2656, 0.445, -0.1849, -0.776, rnorm(50000, 0, 1)))

# duplicate 30 more columns
mydata2 <- bind_cols(mydata, mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6])

(mydata3 <- mydata2 %>% gather(key = "sample_num", value = "fc", 2:length(mydata)) %>%
  filter(fc > 0.584963 | fc < -0.584963) %>%
  select(X) %>%
  arrange(desc(X)) %>%
  unique() %>%
  head())
#>         X
#> 1  GENE_7
#> 5  GENE_5
#> 7  GENE_2
#> 8  GENE_1
#> 10   9999
#> 14   9998

Created on 2019-03-01 by the reprex package (v0.2.1)

Arthur Yip
  • 5,810
  • 2
  • 31
  • 50
  • This gives the result but the format of the result would still be hard to work with. I am expecting a result with a table having the same columns but just the genes narrowed down. I did this for narrowing down genes with pval<0.05 but this was just one requirement so it was fine, unlike here for fc where I have two requirements <-0.584963 and >0.584963. – Sayan28 Mar 01 '19 at 06:06
  • right now the code produces a column with all the gene names. What do you need/expect? – Arthur Yip Mar 01 '19 at 06:08
  • I have just re-edited the question with the expected format as I was unsure on how to put it in the comment – Sayan28 Mar 01 '19 at 06:15
2

Long story short, you had the abs() in the wrong place in your code.

I fixed it here:

det.cols<- colnames(my.data)[which(grepl("fc",tolower(colnames(my.data))))]
filt <- gsub(","," | ",toString(paste("abs(`",det.cols,"`)",">0.584963", sep = "")))
my.datasub<- my.data %>% filter_(filt)

For further flexibility, @ha_pu provided a great filter_at solution building off my previous solution (before I identified the error in your code).

Arthur Yip
  • 5,810
  • 2
  • 31
  • 50
1

Using filter_at from dplyr might be an even more flexible approach...

# set up sample data with 50000 rows [as proposed by Arthur Yip above]
mydata <- tibble(X = c("GENE_1", "GENE_2", "GENE_3", "GENE_4", "GENE_5", "GENE_6", "GENE_7", 1:50000),
                     SAMPLE_1_FC = c(0.678, 0.234, 0.017, -0.0944, -0.808, 0.2091, 0.598, rnorm(50000, 0, 1)),
                     SAMPLE_2_FC = c(0.405, -0.667, 0.156, -0.1372, -0.78, 0.1106, 0.768, rnorm(50000, 0, 1)),
                     SAMPLE_3_FC = c(0.887, 0.002, 0.112, -0.18, -0.556, 0.0121, 0.997, rnorm(50000, 0, 1)),
                     SAMPLE_4_FC = c(0.33, 0.124, 0.008, -0.2228, 0.034, -0.0864, 0.467, rnorm(50000, 0, 1)),
                     SAMPLE_5_FC = c(0.223, 0.356, -0.123, -0.2656, 0.445, -0.1849, -0.776, rnorm(50000, 0, 1)))

# duplicate 30 more columns [as proposed by Arthur Yip above]
mydata2 <- bind_cols(mydata, mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6])

mydata2 %>%
  filter_at(vars(contains("fc")), .vars_predicate =  any_vars(abs(.) > 0.584963))

In the vars() you can define your list of variables to which you want to apply the filtering. Following .vars_predicate you can define the filter criterion (any_vars is equal to |, all_vars is equal to &).

Arthur Yip
  • 5,810
  • 2
  • 31
  • 50
ha-pu
  • 581
  • 7
  • 19
  • 1
    Thank you ha_pu for the answer and sorry I was not clear on the question. I have edited my question now and added an example data as well. I have multiple columns and want to extract a row in which at least one of the columns has >0.584963 or <-0,584963. – Sayan28 Feb 28 '19 at 23:24
  • 1
    Thanks for the clarifications, I have adapted my response accordingly. – ha-pu Mar 01 '19 at 08:46
  • 1
    Sayan wants the columns with "fc" so we can say filter_at(vars(contains("fc")), ... which is basically a much simpler version of the grepl and custom filt that was in the opening post. – Arthur Yip Mar 01 '19 at 09:15
  • 1
    Depeding on your needs, you can also use matches() in the vars() expression, allowing to use regex, which makes your variable selection even more flexible... – ha-pu Mar 01 '19 at 09:28