5

I have a column containing speed measurements which I need to change to numeric so that I can use both the mean and sum functions. However, when I do convert them the values change substantially.

Why is this?

This is what my data look like at first:

enter image description here

And here is the structure of the data frame:

'data.frame':   1899571 obs. of  20 variables:
 $ pcd        : Factor w/ 1736958 levels "AB101AA","AB101AB",..: 1 2 3 4 5 6 6 7 7 8 
 $ pcdstatus  : Factor w/ 5 levels "Insufficient Data",..: 4 4 4 4 4 2 3 2 3 3 ...
 $ mbps2      : Factor w/ 3 levels "N","N/A","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ averagesp  : Factor w/ 301 levels ">=30","0","0.2",..: 301 301 301 301 301 301 301 
 $ mediansp   : Factor w/ 302 levels ">=30","0","0.1",..: 302 302 302 302 302 302 302 
 $ maxsp      : Factor w/ 301 levels ">=30","0","0.2",..: 301 301 301 301 301 301 301 
 $ nga        : Factor w/ 2 levels "N","Y": 1 2 1 1 1 1 1 2 2 2 ...
 $ connections: Factor w/ 119 levels "<3","0","1","10",..: 2 2 2 2 2 1 2 1 2 2 ...
 $ pcd2       : Factor w/ 1736958 levels "AB10 1AA","AB10 1AB",..: 1 2 3 4 5 6 6 7 7 8 
 $ pcds       : Factor w/ 1736958 levels "AB10 1AA","AB10 1AB",..: 1 2 3 4 5 6 6 7 7 8 
 $ oslaua     : Factor w/ 407 levels "","95A","95B",..: 374 374 374 374 374 374 374 
 $ x          : int  394251 394232 394181 394251 394371 394181 394181 394331 394331 
 $ y          : int  806376 806470 806429 806376 806359 806429 806429 806530 806530 
 $ ctry       : Factor w/ 4 levels "E92000001","N92000002",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ hro2       : Factor w/ 13 levels "","E12000001",..: 12 12 12 12 12 12 12 12 12 12 
 $ soa1       : Factor w/ 34381 levels "","E01000001",..: 32485 32485 32485 32485 
 $ dzone1     : Factor w/ 6507 levels "","E99999999",..: 128 128 128 128 112 128 128 
 $ soa2       : Factor w/ 7197 levels "","E02000001",..: 6784 6784 6784 6784 6784 6784 
 $ urindew    : int  9 9 9 9 9 9 9 9 9 9 ...
 $ soa1ni     : Factor w/ 892 levels "","95AA01S1",..: 892 892 892 892 892 892 892 892 

This is the code for converting my variables to numeric variables.

 #convert individual columns to numeric variables  
 total$averagesp <- as.numeric(total$averagesp) 
 total$mediansp <- as.numeric(total$mediansp) 
 total$maxsp <- as.numeric(total$maxsp) 
 total$mbps2 <- as.numeric(total$mbps2)
 total$nga <- as.numeric(total$nga)
 total$connections <- as.numeric(total$connections)

But I have this strange output afterwards where all my data have been inflated:

enter image description here

Any help would be much appreciated - thank you!

Community
  • 1
  • 1
Thirst for Knowledge
  • 1,606
  • 2
  • 26
  • 43
  • 4
    How do you expect R to convert `">=30"`, `"<3"`, `"Y"`, and `"N"` to numbers? – Joshua Ulrich Apr 01 '14 at 15:42
  • True - but I didn't put all of my code into this question to keep it concise. In the actual script I convert all of these characters into pure numerics. Yet, it still inflates all of my data? – Thirst for Knowledge Apr 01 '14 at 15:44
  • 3
    It doesn't "inflate". It uses the factor values, not the levels. – Joshua Ulrich Apr 01 '14 at 15:45
  • After removing the symbols and then rerunning the code in a different order, so that the last thing I did was convert the character variable to numeric, I solved the problem. Thanks, Ed – Thirst for Knowledge Apr 01 '14 at 15:52
  • Do not edit your title to indicate "SOLVED". Under normal circumstances, an accepted answer would serve that purpose. In this case, your answer below won't really help anyone, but the pointer to the duplicate will, as that is the actual source of your problem. – joran Apr 01 '14 at 15:58

1 Answers1

9

See FAQ 7.10. Basically when you use as.numeric on a factor then you get the underlying integers. The FAQ has the recipes for turning them into the numbers represented by the strings.

Greg Snow
  • 48,497
  • 6
  • 83
  • 110