Using cbind causes wrong interpretation of numeric variable

Question

When I build the following data.frame:

cntrydata<-as.data.frame(cbind(c('BE', 'BG', 'CH', 'CY', 'CZ', 'DE', 'DK', 'EE', 
             'ES', 'FI', 'FR', 'GB', 'GR', 'HR', 'HU', 'IE', 
             'IL', 'LT', 'NL', 'NO', 'PL', 'PT', 'RU', 'SE', 
             'SI', 'SK', 'UA'),c('C', 'P', 'C', 'P', 'P', 'C', 
             'C', 'C', 'C', 'C', 'C', 'C', 'P', 'P', 'P', 'C',
             'P', 'P', 'C', 'C', 'P', 'C', 'P', 'C', 'P', 'P', 'P'),
              c(7.1, 3.6, 8.7, 6.3, 4.6, 7.9, 9.3, 6.5, 
                6.1, 9.1, 6.8, 7.6, 3.5, 4.1, 4.7, 8, 6.1, 5, 8.8,
                8.6, 5.3, 6, 2.1, 9.2, 6.4, 4.3, 2.4)))
colnames(cntrydata)<-c('cntry','mode','CPI')

The CPI variable is of the class(factor), while I need it to be numeric to make the following function to work:

boxplot(CPI~mode, data=cntrydata)

I tried the following:

as.numeric(levels(cntrydata$CPI))[cntrydata$CPI]

As adviced on How to convert a factor to an integer\numeric without a loss of information?

But it is still of the class factor. Any ideas how to reach my goal?

Also, but less importantly, I was looking how to include the colnames argument in the data construction command (instead of afterwards, as I did eventually). But couldn't find how and where to put it?

`class(as.numeric(levels(cntrydata$CPI))[cntrydata$CPI])` returns `numeric` — Matthew Lundberg, Apr 01 '13 at 16:02
`cbind` here is trying to give you everything of the same kind, which isn't what you want - and you don't need it here anyway. Do: `cntrydata<-data.frame(cntry=c('BE', 'BG', 'CH', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR', 'GB', 'GR', 'HR', 'HU', 'IE', 'IL', 'LT', 'NL', 'NO', 'PL', 'PT', 'RU', 'SE'),mode=c('C', 'P', 'C', 'P', 'P', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'P', 'P', 'P', 'C', 'P', 'P', 'C', 'C', 'P', 'C', 'P', 'C'), CPI=c(7.1, 3.6, 8.7, 6.3, 4.6, 7.9, 9.3, 6.5, 6.1, 9.1, 6.8, 7.6, 3.5, 4.1, 4.7, 8, 6.1, 5, 8.8, 8.6, 5.3, 6, 2.1, 9.2))` etc — Jonathan Dursi, Apr 01 '13 at 16:04

score 2 · Accepted Answer · answered Apr 01 '13 at 16:03

The following would do the conversion:

cntrydata$CPI <- as.numeric(as.character(cntrydata$CPI))

If you were to construct the data frame as follows, you wouldn't have the issue and you'd also get the column names:

> cntrydata <- data.frame(cntry=c('BE', 'BG', 'CH', 'CY', 'CZ', 'DE', 'DK', 'EE', 
+              'ES', 'FI', 'FR', 'GB', 'GR', 'HR', 'HU', 'IE', 
+              'IL', 'LT', 'NL', 'NO', 'PL', 'PT', 'RU', 'SE', 
+              'SI', 'SK', 'UA'), mode=c('C', 'P', 'C', 'P', 'P', 'C', 
+              'C', 'C', 'C', 'C', 'C', 'C', 'P', 'P', 'P', 'C',
+              'P', 'P', 'C', 'C', 'P', 'C', 'P', 'C', 'P', 'P', 'P'),
+               CPI=c(7.1, 3.6, 8.7, 6.3, 4.6, 7.9, 9.3, 6.5, 
+                 6.1, 9.1, 6.8, 7.6, 3.5, 4.1, 4.7, 8, 6.1, 5, 8.8,
+                 8.6, 5.3, 6, 2.1, 9.2, 6.4, 4.3, 2.4))

score 2 · Answer 2 · answered Apr 01 '13 at 16:06

Your main problem is the way you're creating the data.frame. Do not use cbind and as.data.frame. Try this:

cntrydata <- data.frame( cntry = c('BE', 'BG', 'CH', 'CY', 'CZ', 'DE', 'DK', 'EE', 
         'ES', 'FI', 'FR', 'GB', 'GR', 'HR', 'HU', 'IE', 
         'IL', 'LT', 'NL', 'NO', 'PL', 'PT', 'RU', 'SE', 
         'SI', 'SK', 'UA'), mode = c('C', 'P', 'C', 'P', 'P', 'C', 
         'C', 'C', 'C', 'C', 'C', 'C', 'P', 'P', 'P', 'C',
         'P', 'P', 'C', 'C', 'P', 'C', 'P', 'C', 'P', 'P', 'P'),
          CPI = c(7.1, 3.6, 8.7, 6.3, 4.6, 7.9, 9.3, 6.5, 
            6.1, 9.1, 6.8, 7.6, 3.5, 4.1, 4.7, 8, 6.1, 5, 8.8,
            8.6, 5.3, 6, 2.1, 9.2, 6.4, 4.3, 2.4))

sapply(cntrydata, class)
#     cntry      mode       CPI 
#  "factor"  "factor" "numeric"

This is because, when you use cbind, at least one argument of to it must be a data.frame for your data to be a data.frame. If not, the result will be a matrix. And in a matrix, all data should be of the same class. And since one or more columns of your data are character type, the numeric column is also coerced to character data.

score 0 · Answer 3 · answered Apr 01 '13 at 16:04

You need to use as.character() before as.numeric

The reason for this is that factors are in reality integers with key-value label.
If you use simply as.numeric(someFactor) you are gettting the equivalent of the key.
You want the equivalent of the value, which you can get via as.character.
But then you want your final result to be numeric, hence you wrap it all together:

 as.numeric(as.character(someFactor))

Compare:

 > as.numeric(cntrydata$CPI)
  [1] 17  4 22 13  7 19 26 15 12 24 16 18  3  5  8 20 12  9 23 21 10 11  1 25 14  6  2

 > as.numeric(as.character(cntrydata$CPI))
  [1] 7.1 3.6 8.7 6.3 4.6 7.9 9.3 6.5 6.1 9.1 6.8 7.6 3.5 4.1 4.7 8.0 6.1 5.0 8.8 8.6
 [21] 5.3 6.0 2.1 9.2 6.4 4.3 2.4

Using cbind causes wrong interpretation of numeric variable

3 Answers3