0

I recently reverted to R version 3.1.3 for compatibility reasons and am now encountering an unexplained error with the subset function.

I want to extract all rows for the gene "Migut.A00003" from the data frame transcr_effects using the gene name as listed in the data frame expr_mim_genes. (this will later become a loop). This action always returns all rows instead of specific rows I am looking for, no matter the formatting of the subset lookup:

> class(expr_mim_genes)
[1] "data.frame"

> sapply(expr_mim_genes, class)
       gene  longest.tr pair.length 
"character"   "logical"   "numeric" 

> head(expr_mim_genes)
          gene longest.tr pair.length
1 Migut.A00003         NA           0
2 Migut.A00006         NA           0
3 Migut.A00007         NA           0
4 Migut.A00012         NA           0
5 Migut.A00014         NA           0
6 Migut.A00015         NA           0

> class(transcr_effects)
[1] "data.frame"

> sapply(transcr_effects, class)
       pair        gene 
"character" "character" 

> head(transcr_effects)
       pair         gene
1     pair1 Migut.N01020
2    pair10 Migut.A00351
3  pair1000 Migut.F00857
4 pair10007 Migut.D01637
5 pair10008 Migut.A00401
6 pair10009 Migut.G00442
. . .
7168 pair3430 Migut.A00003
. . .

The gene I am interested in:

> expr_mim_genes[1,"gene"]
[1] "Migut.A00003"

R sees these two terms as equivalent:

> expr_mim_genes[1,"gene"] == "Migut.A00003"
[1] TRUE

If I type in the name of the gene manually, the correct number of rows are returned:

> nrow(subset(transcr_effects, transcr_effects$gene=="Migut.A00003"))
[1] 1
> subset(transcr_effects, transcr_effects$gene=="Migut.A00003")
         pair         gene
7168 pair3430 Migut.A00003

However, this should return one row from the data.frame but it returns all rows:

> nrow(subset(transcr_effects, transcr_effects$gene == (expr_mim_genes[1,"gene"]))
[1] 10122

I have a feeling this has something to do with text formatting, but I've tried everything and haven't been able to figure it out. I've seen this issue with quoted v.s. unquoted entries, but it does not appear to be the issue here (see equality above).

I didn't have this problem before switching to R v.3.1.3, so maybe it is a version convention I am unaware of?

EDIT: This is driving me crazy, but at least I think I have found a patch. There was quite a bit of data and file processing to get to this point in the code, involving loading at least 4 files. I've tried taking snippets of each file to post a reproducible example here, but sometimes when I analyze the snippets the error recurs, sometimes it does not (!!). After going through the process though, I discover that:

i = 1
gene = expr_mim_genes[i,"gene"]

> nrow(subset(transcr_effects, gene == gene))
[1] 10122
> nrow(subset(transcr_effects, gene == (expr_mim_genes[i,"gene"])))
[1] 1

I still can't explain this behavior of the code, but at least I know how to work around it. Thanks all.

BillieG
  • 1
  • 2
  • 2
    Probably unrelated to your problem but why are you using `transcr_effects$` in the `subset` condition? The whole point of the function is to omit that. Also, what does `transc_effects[transc_effects$gene == expr_mim_genes[1,"gene"], ]` return? – Konrad Rudolph Apr 24 '17 at 15:55
  • It's easier to help you if you provide a proper [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that we can copy/paste into R to test. Using `dput()` is helpful. – MrFlick Apr 24 '17 at 16:08

0 Answers0