34

I have a data file with the format from above.
I loaded it into R, and tried to plot a histogram with the values from the dist column and I have got the error "x must be numeric".Therefore I tried to change the format.

> head(data)

    V1        V2
1 type gene_dist
2    A     64667
3    A     76486
4    A     97416
5    A     30876
6    A     88018

> summary(data)
    V1            V2     
 A   : 67   100    :  1  
 B   :122   100906 :  1  
 type:  1   102349 :  1  
            1033   :  1  
            10544  :  1  
            10745  :  1  
            (Other):184  

I tried to set the format for the column using sapply but the values are changed:

> data[,2]<-sapply(data[,2],as.numeric)

> head(data)
    V1  V2
1 type 190
2    A 146
3    A 166
4    A 189

summary(data)
    V1            V2        
 A   : 67   Min.   :  1.00  
 B   :122   1st Qu.: 48.25  
 type:  1   Median : 95.50  
            Mean   : 95.50  
            3rd Qu.:142.75  
            Max.   :190.00 

Does anyone know why is this happening?

Marek
  • 49,472
  • 15
  • 99
  • 121
agatha
  • 1,513
  • 5
  • 16
  • 28
  • 1
    can you paste the output of `dput(data)` so that we can reproduce your results. My suspicion is that you are converting a `factor` to `numeric` directly, which is causing the problem. Try replacing it with `function(x) as.character(as.numeric(x))` and see if that works – Ramnath Jun 13 '11 at 09:47
  • @ Ramnath - problem solved with as.numeric(as.character(x)) – agatha Jun 13 '11 at 09:58
  • It looks like R is classing your columns as factors because you're reading the header as a row entry. Setting `header = T` in your `read.table()` call should fix this. – Richard Herron Jun 13 '11 at 11:28
  • @ricardh - I removed the columns from the text file and added them manually, probably not the most elegant way...but it works. colnames(chip_data)<-c("type","gene_dist") – agatha Jun 13 '11 at 12:02

4 Answers4

67

It looks like your second column is a factor. You need to use as.character before as.numeric. This is because factors are stored internally as integers with a table to give the factor level labels. Just using as.numeric will only give the internal integer codes. There is no need to use sapply since these functions are vectorized.

data[,2] <- as.numeric(as.character(data[,2]))

It is likely that the column is a factor because there are some non-numeric characters in some of the entries. Any such entries will be converted to NA with the appropriate warning, but you may want to investigate this in your raw data.

As a side note, data is a poor (though not invalid) choice for a variable name since there is a base function of the same name.

James
  • 65,548
  • 14
  • 155
  • 193
  • @ James : It worked. Thanks and I will consider your observation. – agatha Jun 13 '11 at 09:55
  • @Andra I can see now that your question is formatted a little better that one reason that it is a factor is that the column names are included in the data. You might want to add a `header=TRUE` argument to the command you read the data in with. – James Jun 13 '11 at 11:27
  • @James- I will remember that. - I removed the columns from the text file and added them manually, probably not the most elegant way...but it works. colnames(chip_data)<-c("type","gene_dist") – agatha Jun 13 '11 at 12:05
  • Also see http://stackoverflow.com/q/3418128/210673: `as.numeric(levels(f))[f]` is an alternate method that is slightly more efficient. – Aaron left Stack Overflow Jun 13 '11 at 14:48
1

I had the same issue, but as I found, the root cause was different, and so I share this as an answer but not a comment.

df <- read.table(doc.csv, header = TRUE, sep = ",", dec = ".")
df$value

# Results in
[1]  2254    1873    2201    2147    2456    1785

# So..
as.numeric(df$value)
[1] 26 14 22 20 32 11

In my case, the reason was that there were spaces with the values in the original csv document. Removing the spaces fixed the issue.

From the dput(df)

" 1178  ", " 1222  ", " 1223  ", " 1314  ", " 1462  ", 
0

I had this same issue for a matrix containing 'list' values, when an object data was read in from read.csv. as.character() does not work here, and as.numeric() and data.matrix() changed the values in the matrix. Instead you need to use the following:

matrix_numeric[1:m,1:n] <- as.numeric(as.matrix(data[1:m,1:n]))

First converting to a character then to a double. For matrix dimensions data[m,n]. (you need to create the object matrix_numeric first before assigning values... matrix_numeric <- matrix(0,m,n) )

For a vector vec1 in list format, I use the following:

out1 <- as.numeric(unlist(vec1));

Entropy
  • 133
  • 2
  • 12
0

It's probably much better to fix it when reading the file than by using as.numeric() or as.character(). When reading your file, make sure to have:

  • header=TRUE if first row is header
  • NA and not Na or NaN (ctrl+H and replace by NA in your datafile)
  • no other character strings in your numeric columns

Then R will automatically consider them as numeric.

Nakx
  • 1,460
  • 1
  • 23
  • 32