1

I would like to extract items from a column in a data frame based on criteria pertaining to values in other columns. These criteria are given in the form of a list associating column names with values. The ultimate goal is to use those items to select columns by name in another data structure.

Here is an example data frame:

> experimental_plan
  lib genotype treatment replicate
1   A       WT    normal         1
2   B       WT       hot         1
3   C      mut    normal         1
4   D      mut       hot         1
5   E       WT    normal         2
6   F       WT       hot         2
7   G      mut    normal         2
8   H      mut       hot         2

And my selection criteria are encoded as the following list:

> ref_condition = list(genotype="WT", treatment="normal")

I want to extract the items in the "lib" column where the line matches ref_condition, that is "A" and "E".

1) I can get the columns to use for selection using names on my list of selection criteria:

> experimental_plan[, names(ref_condition)]
  genotype treatment
1       WT    normal
2       WT       hot
3      mut    normal
4      mut       hot
5       WT    normal
6       WT       hot
7      mut    normal
8      mut       hot

2) I can test whether the resulting lines match my selection criteria:

> experimental_plan[, names(ref_condition)] == ref_condition
     genotype treatment
[1,]     TRUE      TRUE
[2,]     TRUE     FALSE
[3,]    FALSE      TRUE
[4,]    FALSE     FALSE
[5,]     TRUE      TRUE
[6,]     TRUE     FALSE
[7,]    FALSE      TRUE
[8,]    FALSE     FALSE
> selection_vector <- apply(experimental_plan[, names(ref_condition)] == ref_condition, 1, all)
> selection_vector
[1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

(I think this step, with the apply is not particularly elegant. There must be a better way.)

3) This boolean vector can be used to select the relevant lines:

> selected_lines <- experimental_plan[selection_vector,]
> selected_lines
  lib genotype treatment replicate
1   A       WT    normal         1
5   E       WT    normal         2

4) From this point on, I know how to use dplyr to select items I'm interested in:

> lib1 <- filter(selected_lines, replicate=="1") %>% select(lib) %>% unlist()
> lib2 <- filter(selected_lines, replicate=="2") %>% select(lib) %>% unlist()
> lib1
lib 
  A 
Levels: A B C D E F G H
> lib2
lib 
  E 
Levels: A B C D E F G H

Can dplyr (or other clever techniques) be used in earlier steps?

5) These items happen to correspond to column names in another data structure (named counts_data here). I use them to extract the corresponding columns and put them in a list, associated with replicate numbers as names:

> counts_1 <- counts_data[, lib1]
> counts_2 <- counts_data[, lib2]
> list_of_counts <- list("1" <- counts_1, "2" <- counts_2)

(Ideally, I would like to generalize the code so that I do not need to know (I mean, "hard-code them") what different values exist in the "replicate" column: there could be any number of replicates for a given combination of "genotype" and "treatment" characteristics, and I want my final list to contain the data from the counts_data pertaining to the corresponding "lib" items.)

Is there a way to do the whole process more elegantly / efficiently?

bli
  • 7,549
  • 7
  • 48
  • 94

1 Answers1

1

I think you can use data.table for this with a key

library(data.table)
test <- data.table(lib = LETTERS[1:8],
           genotype = rep(c("WT","WT","mut","mut"),2),
           treatment = rep(c("normal","hot"),4),
           replicate = c(rep(1,4),rep(2,4)))
setkeyv(test,c("genotype","treatment"))
ref_condition = list(genotype="WT", treatment="normal")
test[ref_condition,lib]

This gives

[1] "A" "E"

You could of course use lapply to loop over a list of test conditions.

Frank
  • 66,179
  • 8
  • 96
  • 180
Martin
  • 1,084
  • 9
  • 15
  • 1
    Fyi, latest recommendation is to use `on` in most cases instead of bothering with keys: http://stackoverflow.com/a/20057411/ – Frank Nov 22 '16 at 16:06
  • I tried this. It worked with your example, but in my actual code I got "Error in `[.data.frame`(x, i, j) : object 'lib' not found". A test with your code shows that using "lib" instead of lib doesn't give the expected results, but I tried anyway, and now I get "Error in xj[i] : invalid subscript type 'list'" I used `names(ref_condition)` when setting the keys. A test with your code tells me that this is not the cause of the problem. What can produce such errors? – bli Nov 23 '16 at 18:20
  • I think you still use a data.frame instead of a data.table. Say your data is in data frame "df", then use dt <- as.data.table(df) – Martin Nov 25 '16 at 09:54
  • @Martin Whether by using `data.table()` or `as.data.table()` to convert my object, the result is the same: `print(class(dt))` returns `[1] "data.table" "data.frame"` and the error `object 'lib' not found` occurs. There is actually one more line to the error message: `Calls: -> get_counts -> [ -> [.data.table -> [.data.frame` – bli Nov 29 '16 at 12:09
  • @bli can you provide a minimal working code sample that produces the error? – Martin Nov 30 '16 at 09:38