112

I've imported a test file and tried to make a histogram

pichman <- read.csv(file="picman.txt", header=TRUE, sep="/t")   
hist <- as.numeric(pichman$WS)    

However, I get different numbers from values in my dataset. Originally I thought that this because I had text, so I deleted the text:

table(pichman$WS)    
ws <- pichman$WS[pichman$WS!="Down" & pichman$WS!="NoData"]    

However, I am still getting very high numbers does anyone have an idea?

csgillespie
  • 59,189
  • 14
  • 150
  • 185
eliavs
  • 2,306
  • 4
  • 23
  • 33
  • See also http://stackoverflow.com/questions/4798343/ and http://stackoverflow.com/questions/3418128 – Aaron left Stack Overflow Feb 08 '11 at 15:15
  • you can use `hablar::retype` after importing the csv file and it will convert all columns to an appropiate data type, i.e. never to factor. So just add `pichman %>% retype`. – davsjob Nov 04 '18 at 11:12

2 Answers2

148

I suspect you are having a problem with factors. For example,

> x = factor(4:8)
> x
[1] 4 5 6 7 8
Levels: 4 5 6 7 8
> as.numeric(x)
[1] 1 2 3 4 5
> as.numeric(as.character(x))
[1] 4 5 6 7 8

Some comments:

  • You mention that your vector contains the characters "Down" and "NoData". What do expect/want as.numeric to do with these values?
  • In read.csv, try using the argument stringsAsFactors=FALSE
  • Are you sure it's sep="/t and not sep="\t"
  • Use the command head(pitchman) to check the first fews rows of your data
  • Also, it's very tricky to guess what your problem is when you don't provide data. A minimal working example is always preferable. For example, I can't run the command pichman <- read.csv(file="picman.txt", header=TRUE, sep="/t") since I don't have access to the data set.
csgillespie
  • 59,189
  • 14
  • 150
  • 185
  • 1
    I added a timing in a new answer. +1 for you as you had it correct and gave all options. – Joris Meys Feb 08 '11 at 10:23
  • 1
    thank a million! i deleted the values "Down" and "NoData" after i saw that it is not only numbers and yes i got my slashes mixed up – eliavs Feb 08 '11 at 11:08
14

As csgillespie said. stringsAsFactors is default on TRUE, which converts any text to a factor. So even after deleting the text, you still have a factor in your dataframe.

Now regarding the conversion, there's a more optimal way to do so. So I put it here as a reference :

> x <- factor(sample(4:8,10,replace=T))
> x
 [1] 6 4 8 6 7 6 8 5 8 4
Levels: 4 5 6 7 8
> as.numeric(levels(x))[x]
 [1] 6 4 8 6 7 6 8 5 8 4

To show it works.

The timings :

> x <- factor(sample(4:8,500000,replace=T))
> system.time(as.numeric(as.character(x)))
   user  system elapsed 
   0.11    0.00    0.11 
> system.time(as.numeric(levels(x))[x])
   user  system elapsed 
      0       0       0 

It's a big improvement, but not always a bottleneck. It gets important however if you have a big dataframe and a lot of columns to convert.

Joris Meys
  • 106,551
  • 31
  • 221
  • 263