0

I'm new to R and Stack Overflow, so probably my question makes a lot of mistakes, sorry in advance.

I'm using caret's cor() function, and it took me an hour to fix a small problem, but I still don't understand what's wrong. Basically I have a data.frame, and I want to flag numeric variables that are highly correlated. So I create a subset of the numeric variables, except for SalePrice, which has NAs in the test set:

numericCols <- which(sapply(full[,!(names(full) %in% 'SalePrice')], is.numeric))   

Then

cor(full[,numericCols])    

gives an error:

Error in cor(full[, numericCols]) : 'x' must be numeric.

Except when I do it this way:

numericCols2 <- which(sapply(full, is.numeric))    
numericCols2 <- numericCols2[-31] #dropping SalePrice manually    

it works just fine.

When I do numericCols == numericCols2 the output is:

LotFrontage     
TRUE    
LotArea    
TRUE    
# .    
# .   All true    
# .    
HouseAge    
FALSE    
isNew    
FALSE    
Remodeled    
FALSE    
BsmtFinSF    
FALSE    
PorchSF    
FALSE    

All the ones that are false are variables I've created myself, for example HouseAge:

full$HouseAge <- full$YrSold - full$YearBuilt    

Why is this happening?

duckmayr
  • 16,303
  • 3
  • 35
  • 53
  • Welcome to Stack Overflow! It would be much easier for others to help you if we had access to (at least a subset of) your data. You may want to look at [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for more information. In the mean time, what happens if you change your first line there to `numericCols <- which(sapply(full[!(names(full) %in% 'SalePrice')], is.numeric))` -- that is, omitting the comma? – duckmayr Jun 22 '19 at 00:30
  • Still the same problem. – NotWarrenBuffett Jun 22 '19 at 00:46
  • 2
    I would provide some of your data, such as by editing your question to include the output from `dput(head(full))`; this will make it much easier for others to spot the problem. – duckmayr Jun 22 '19 at 00:55

1 Answers1

1

Sale Price in your data.frame is probably character or some other non-numeric column. Here is an example to reproduce your problem and explanation why you get an error doing it one way and you do not get an error doing it the other way.

Let's simulate some data ( I use iris data set from MASS package and add a character column "SalePrice"):

data(iris)
full <- cbind(data.frame(SalePrice=rep("NA", nrow(iris))),iris)

If we examine the dataframe full, we will see that "SalePrice" column is character:

str(full)
# 'data.frame': 150 obs. of  6 variables:
#   $ SalePrice   : Factor w/ 1 level "NA": 1 1 1 1 1 1 1 1 1 1 ...
# $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Now let's examine what happens when you use the following function:

numericCols <- which(sapply(full[,!(names(full) %in% 'SalePrice')], is.numeric))
cor(full[, numericCols])
numericCols
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
# 1             2            3            4 

It returns you a numeric vector with column index within a subset full[,!(names(full) %in% 'SalePrice')] As you can see in my dataframe "SalePrice is the first column, so if I exclude it and then will try to find all numeric columns within the resulting data.frame I will get columns 1,2,3 and 4 instead of 2,3,4 and 5

And then when I execute cor() function, I get an error:

cor(full[, numericCols])
#Error in cor(full[, numericCols]) : 'x' must be numeric

Your other approach works as it returns correct column indices:

numericCols2 <- which(sapply(full, is.numeric))  
numericCols2
#Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
#           2            3            4            5  
Katia
  • 3,784
  • 1
  • 14
  • 27