Error while mapping SYMBOLS to ENTREZID

Question

I am getting a strange error converting Gene Symbols to Entrez ID. Here is my code:

testData = read.delim("IL_CellVar.txt",head=T,row.names = 2)
testData[1:5,1:3]

# ClustID Genes.Symbol  ChrLoc
# NM_001034168.1       4         Ank2 chrNA:-1--1
# NM_013795.4          4        Atp5l chrNA:-1--1
# NM_018770            4       Igsf4a chrNA:-1--1
# NM_146150.2          4         Nrd1 chrNA:-1--1
# NM_134065.3          4        Epdr1 chrNA:-1--1

clustNum = 5
filteredClust = testData[testData$ClustID == clustNum,]

any(is.na(filteredClust$Genes.Symbol))
# [1] FALSE

selectedEntrezIds <- unlist(mget(filteredClust$Genes.Symbol,org.Mm.egSYMBOL2EG))

# Error in unlist(mget(filteredClust$Genes.Symbol, org.Mm.egSYMBOL2EG)) :
#  error in evaluating the argument 'x' in selecting a method for function 
#     'unlist': Error in #.checkKeysAreWellFormed(keys) :
#  keys must be supplied in a character vector with no NAs

Another approach fails too:

selectedEntrezIds = select(org.Mm.eg.db,filteredClust$Genes.Symbol, "ENTREZID")

# Error in .select(x, keys, columns, keytype = extraArgs[["kt"]], jointype = jointype) :
#   'keys' must be a character vector

Just for the sake or error, removing 'NA', doesn't help:

a <- filteredClust$Genes.Symbol[!is.na(filteredClust$Genes.Symbol)]
selectedEntrezIds <- unlist(mget(a,org.Mm.egSYMBOL2EG))

# Error in unlist(mget(a, org.Mm.egSYMBOL2EG)) : 
#   error in evaluating the argument 'x' in selecting a method for function 
#      'unlist': Error in # .checkKeysAreWellFormed(keys) : 
#  keys must be supplied in a character vector with no NAs

I am not sure why I am getting this error as the master file from which gene symbols were extracted for testData gives no problem while converting to EntrezID. Would apprecite help on this.

Where does `org.Mm.eg.db` or `org.Mm.egSYMBOL2EG` come from? What R packages are you using? What is `class(filteredClust$Genes.Symbol[1])` and `class(get(filteredClust$Genes.Symbol[1], org.Mm.egSYMBOL2EG))` and what's the result of `table(lapply(mget(filteredClust$Genes.Symbol, org.Mm.egSYMBOL2EG), class))` — MrFlick, Aug 15 '14 at 05:50
@MrFlick, these are packages from `bioConductor` used for these mappings. — Xin Yin, Aug 15 '14 at 06:01
Well, the could should be a complete [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that we can run to get the same error as the OP. Without the necessary library/package information in the code, that is not possible. — MrFlick, Aug 15 '14 at 06:05
This was cross-posted and answered on the [Bioconductor mailing list](https://stat.ethz.ch/pipermail/bioconductor/2014-August/061047.html), so either @XinYin or I have wasted our time. Please don't cross-post, choose a solution forum and stick with it. — Martin Morgan, Aug 15 '14 at 10:49

Xin Yin · Accepted Answer · 2014-08-15T07:15:58.727

Since you didn't provide a minimal reproducible example for us to replicate the error you've experienced, I'm making a speculation here based on the error message. This is most likely caused by the default behavior of read.delim and functions alike (read.csv, read.table etc.) that converts strings in your data file to factor's.

You need to add an extra parameter to read.delim, specifically, stringsAsFactors=F (by default, it is TRUE).

That is,

testData = read.delim("IL_CellVar.txt", head=T, row.names = 2, stringsAsFactors=F)

If you read the documentation:

stringsAsFactors
logical: should character vectors be converted to factors? Note that this is overridden by as.is and colClasses, both of which allow finer control.

You can check the class of your Gene.symbol column by:

class(testData$Gene.Symbol)

and I guess it woul be "factor".

This leads to the error you had:

# Error in .select(x, keys, columns, keytype = extraArgs[["kt"]], jointype = jointype) :
#   'keys' must be a character vector

You can also manually convert the factors to strings/characters by:

testData$Gene.Symbol <- as.character(testData$Gene.Symbol)

You can read more about this peculiar behavior in this chapter of Hadley's book "Advanced R". And I'm quoting the relevant paragraph here:

... Unfortunately, most data loading functions in R automatically convert character vectors to factors. This is suboptimal, because there’s no way for those functions to know the set of all possible levels or their optimal order. Instead, use the argument stringsAsFactors = FALSE to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the data. A global option, options(stringsAsFactors = FALSE), is available to control this behaviour, but I don’t recommend using it. Changing a global option may have unexpected consequences when combined with other code (either from packages, or code that you’re source()ing), and global options make code harder to understand because they increase the number of lines you need to read to understand how a single line of code will behave. ...

Error while mapping SYMBOLS to ENTREZID

1 Answers1