I have a CSV dataset that has a 1000 rows and 21 variables. Out of these 21, 9 are categorical variables having more than 2 values. How do I create dummy variables for the same in R? I wish to conduct logistic regression on this data set to interpret it. I tried using factors and levels to convert them but it works best for 2 variables only I think. I googled quite a bit and found many sites that explain how to do it theoretically but there's not code or function mentioned to understand it fully. On this website, I came across model.matrix () function, the dummies package of R and the dummy.code() function. However I am still stuck because I am newly introduced to R. Sorry for the long question, this is my first time asking here. Thanks in advance!
-
Possible dup? http://stackoverflow.com/questions/3384506/create-new-dummy-variable-columns-from-categorical-variable – retrography Feb 23 '16 at 23:40
-
2You don't need to make dummy variables. If a variable is a factor, the `glm` procedure will be able to figure out the comparisons for inclusion in a logistic regression model. – thelatemail Feb 23 '16 at 23:43
1 Answers
In R most functions will recognize when you are sending categorical values (gender, location, etc.) and will automatically create the dummy variables! For example if you are doing a linear regression you can just do lm(CSV_DATA). If the categorical values are being represented by actual numbers it is recommended to first convert them to a string to allow R to adjust accordingly!
If you must manually do this process you can instead create a loop that will iterate through your dataset and populate additional variables. For each categorical value, you will need n-1 additional variables to represent it as continuous data, n being the number of possible categories the variable contains. with your n-1 new variables you assign each one to a possible category in your original categorial variable. The last category will be represented by 0's in all of your n-1 new variables. For example, if you are trying to represent location and your data can either be "New York", "LA", or "Miami" you would create two (n-1) dummy variables, and for ease of explaining we will give them the name city1 and city2. If the original variable was equal to "New York" you would set city1 = 1 and city2 = 0, if it was "LA" you would set city1 = 0 and city2=1, and if your original value was "Miami" you would set city1=0 and city2=0.
The reason this works is because it does not rank any one of the categories numerically higher than any of the rest, and it uses the last category as a 'reference' to which all the rest are compared! As said previously, if you represent your variables as strings R will do this automatically for you.

- 29
- 6
-
All the categorical variables are represented by actual numbers. Like there is one variable called CHK_ACCT which is coded as: 0 : < 0 1: 0 < ...< 200 2 : => 200 3: no checking account I have to run logistic regression, after splitting the data into training and validation data sets. – Ishan Goradia Feb 23 '16 at 23:50
-
I would reccomend turning your numbers into strings by simply wrapping in quotes in the csv file, then you should be golden! – beeedy Feb 24 '16 at 02:55
-
But, one of the criteria is that I have to create appropriate dummy variables for all categorical variables with more than 2 values, and then use some of the dummy variables to perform regression. – Ishan Goradia Feb 24 '16 at 03:07
-
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post - you can always comment on your own posts, and once you have sufficient [reputation](http://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](http://stackoverflow.com/help/privileges/comment). - [From Review](/review/low-quality-posts/11381716) – Ronak Shah Feb 24 '16 at 03:57
-
Not sure what @RonakShah mean, but I have edited my answer to include the solution more in line with your assignments requirements. – beeedy Feb 24 '16 at 04:06
-