Why does R mix up numerical with categorial variables?

Question

I am confused. I input a .csv file in R and want to fit a linear multivariate regression model. However, R declares all my obvious numeric variables to be factors and my categorial variables to be integers. Therefore, I cannot fit the model.

Does anyone know how to resolve this?

I know this is probably so basic. But I really need to know this. Elsewhere, I found only posts concerning how to declare factors. But this does not apply here.

Any suggestions very much appreciated!

what version of R are you using. Can you give a reproducible example? If is it R 3.1.0 check out: http://stackoverflow.com/questions/22962917/barplot-failure-in-r-3-1-0-read-csv-converting-what-should-be-numerics-to-facto/23248783#23248783 — Andrew Cassidy, Apr 29 '14 at 17:33
Are you reading in the csv with just `temp <- read.csv('file.csv')`? — Stedy, Apr 29 '14 at 17:33
Generally speaking, questions about data read in from a file being misinterpreted are _very_ hard to answer if we can't see the data itself (let alone the commands used to read it in). Usually it means that the data in your file isn't stored in a "standard" way, or at the very least doesn't contain what you think it does. But its really impossible to say without seeing the file itself. — joran, Apr 29 '14 at 17:38

score 1 · Answer 1 · answered Apr 29 '14 at 17:54

1

The easiest way, imo, to handle this is to just tell R what type of data your columns contain when you read them into the workspace. For example, if you have a csv file where the first column should be characters, columns 2-21 should be numeric, and column 22 should be a factor, here's how I would read that csv file into the workspace:

 Data <- read.csv("MyData.csv", colClasses=c("character", rep("numeric", 20), "factor"))

answered Apr 29 '14 at 17:54

shirewoman2

1,842
4
19
31

1

That's a nice approach. Of course, if you don't know the column indices (or want to make a script agnostic to that), there's a downside. – metasoarous Apr 29 '14 at 17:59

score 0 · Accepted Answer · answered Apr 29 '14 at 17:52

Sometimes (with certain versions of R, as Andrew points out) float entries in a CSV are long enough that it thinks they are strings and not floats. In this case, you can do the following

data <- read.csv("filename.csv")
data$some.column <- as.numeric(as.character(data$some.column))

Or you could pass stringsAsFactors=F to the read.csv call, and just apply as.numeric in the next line. That might be a bad idea though if you have a lot of data.

It's a little harder to say what's going on with the categorical variables. You might want to try just treating those as strings and see how that works. Sometimes R will treat factor vectors as being of numeric type, so this is a good first sanity check. If that doesn't work, you can also see if the regression functions in question will let you declare how the variables should be treated.

Thanks a lot for your help! I fixed it. As joran mentioned, it is probably hard to respond to that kind of question without seeing the data. But your suggestions, LauraS and metasoarous, work just fine, except for one column. One column header was called "% of x". I use rstudio and R3.1.0 (thanks to Andrew). After replacing it with "Percentage of x" it worked. And also the other columns can now be read in without problems... Cheers, Paul — user3579082, Apr 29 '14 at 18:27

score 0 · Answer 3 · answered Apr 29 '14 at 18:19

It is hard to tell without a sample of your data file and the commands that you have been using to try and work with the data, but here are some general problems that can lead to what you describe (though there could be other possibilities as well).

The read.csv and read.table (which is called by read.csv) function will try and guess the types of data when they are not told what each column should be (the colClasses argument). If everything looks like a number then it will convert to a number, but if it sees anything in the first lines that does not look like part of a number then it will read it in as character and convert to a factor. Some of the common reasons why what you think should be a number but R sees something non-numeric include: a finger slip results in a letter somewhere in the column; similar looking substitutions, O for 0 or l for 1; a comma where one is not expected, many European files use , where R expects . (but there are options to tell R what you want here) or if you use read.table without setting sep when it really is a comma separated file.

If you have a categorical variable represented by integers, then R will convert it to integers unless you tell it to make a factor. If you use as.numeric on a factor then it will return the integers used to represent the factor internally. How to convert a factor with labels that are numbers to a numeric is a question (and answer) in the FAQ.

If this does not point you in the right direction then give us a sample of your data and what commands you are using.

Why does R mix up numerical with categorial variables?

3 Answers3