1

I have a data frame in R as follows

PROBE_ID    H_1AVG_Signal   H_1Detection Pval   H_2AVG_Signal   H_2Detection Pval   GH_1AVG_Signal  GH_1Detection Pval
ILMN_1343291    47631.78    0.00            53022.43    0.00            46567.29    0.00
ILMN_1651229    135.42      0.01            161.59      0.01            162.46      0.04
ILMN_1651260    80.81       0.86            88.05       0.86            92.45       0.89
ILMN_1651279    143.65      0.01            138.96      0.04            113.29      0.47

Is there any possible way to subset data containing the Probe IDs with detection p value < 0.05 for all samples using the common suffix "Detection Pval" to finally get a subset data as follows

PROBE_ID    H_1AVG_Signal   H_1Detection Pval   H_2AVG_Signal   H_2Detection Pval   GH_1AVG_Signal  GH_1Detection Pval
ILMN_1343291    47631.78    0.00            53022.43    0.00            46567.29    0.00
ILMN_1651229    135.42      0.01            161.59      0.01            162.46      0.04

I would really appreciate advice on how to go about creating such a subset. Thank You.

Sayan28
  • 53
  • 1
  • 6
  • 3
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. – Jaap May 15 '17 at 13:48
  • Probably you need something like: `df[rowSums(df[, grepl('Detection Pval', names(df), fixed = TRUE)] < 0.05) > 0, ]` – Jaap May 15 '17 at 14:22
  • @Jaap, Thank You. I edited my question and hope it is better now. I tried the code and got Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, : 'data' must be of a vector type, was 'NULL' – Sayan28 May 15 '17 at 14:34
  • What does `class(name_of_your_dataframe)` return? – Jaap May 15 '17 at 14:39
  • @Jaap it returns "data.frame" – Sayan28 May 15 '17 at 14:44
  • When I read your data, this `df[rowSums(df[, grepl('Detection_Pval', names(df), fixed = TRUE)] < 0.05) == 3, ]` give me the desired result. – Jaap May 15 '17 at 14:55
  • The error is the result from the fact that you have a space in your columnnames (as presentend in your question). R has probably replaced that with a dot, if that is the case (you can check with `names(df)`) then you should use `df[rowSums(df[, grepl('Detection.Pval', names(df), fixed = TRUE)] < 0.05) == 3, ]` (notice the difference in the specified pattern in `grepl`). – Jaap May 15 '17 at 14:58
  • @Jaap. It worked. The space was the problem. Thanks a lot for the help! – Sayan28 May 15 '17 at 15:11

2 Answers2

1

If you always know the column names you will have then you can use the dplyr filter to get the results you want

library(dplyr)

main.df <- main.df %>%
           filter(`H_1Detection Pval` < 0.05 | `H_2Detection Pval` < 0.05 | `GH_1Detection Pval` < 0.05)

If you don't always know the column names, you can get them dynamically and plug them into the dplyr filter_ command like below

library(dplyr)
# Find any columns that contain "detection" in the column name
det.cols <- colnames(main.df)[which(grepl("detection",tolower(colnames(main.df))))]

# Create a filter string from the column names in the format of
# "`column name` < 0.05 | `column name2` < 0.05"
filt <- gsub(","," | ",toString(paste("`",det.cols,"`"," < 0.05", sep = "")))

# Apply the filter to the dataframe
main.df <- main.df %>%
           filter_(filt)
Matt Jewett
  • 3,249
  • 1
  • 14
  • 21
  • Thank You. I am getting "Error in parse(text = x) : attempt to use zero-length variable name" for the last code applying the filter to the dataframe – Sayan28 May 15 '17 at 14:55
  • Be sure to replace both instances of **main.df** in `det.cols <- colnames(` **main.df** `)[which(grepl("detection",tolower(colnames(` **main.df** `))))]` with the name of the dataframe you are using. – Matt Jewett May 15 '17 at 14:58
  • I did replace main.df with my data name. I got the error for the last code only. – Sayan28 May 15 '17 at 15:14
  • I'm guessing it is not getting any matches for columns that contain "detection" in the column name, which is creating an empty vector for det.cols. Because det.cols is empty it is causing the error to occur in the last line. Note that this code is comparing the column names as all lowercase letters, so be sure to keep the grepl pattern in the det.cols line as all lowercase letters, or it will not return any values. – Matt Jewett May 15 '17 at 15:24
  • I tried by changing the header to lower case and it returned the filtered data without an error message. But now the filtered data has the ILMN_1651279 gene which has a detection pval > 0.05.for the third sample. – Sayan28 May 16 '17 at 03:24
  • Most likely that means one of the other pvals for that gene is <0.05. If you only want to return rows where all pvals are <0.05 then change the pipe operater | in the filt assignment to an ampersand &. That will change the filter from being an "or" statement to an "and" statement. – Matt Jewett May 16 '17 at 04:13
  • Thank You. That change gave the expected result. – Sayan28 May 16 '17 at 05:16
  • In relation to above subsetting, I would like to filter out rows of the expression column with values below -1.5 and above 1.5. How do I do this? When I use filt<- gsub(","," | ",toString(paste("",det.cols,"","% filter_(filt) it gives a filtered data with only values above 1.5 not below -1.5. How do I rewrite this? – Sayan28 Feb 28 '19 at 00:34
0

filter_at is an easier way to do dynamic detection of columns, as discussed in R dplyr filtering data with values greater than +N and lesser than -N : abs() function?

main.df %>% filter_at(vars(contains("Detection Pval")), .vars_predicate = any_vars(. < 0.5))

Arthur Yip
  • 5,810
  • 2
  • 31
  • 50