0

Does anyone know how I can better clean up this data so I can run a logistic regression on it?

I am trying to one-hot encode the variables with multiple categories like race, workclass, etc (as shown in the sample dataset below), but not sure how to do so.

I was planning to change the income to 1 and 0 since there is only 2 categories but I cannot do the same for the rest.

My current plan is to run a logistic regression with all the listed variables:

data <- read.csv("adult_income.csv")
mylogit <- glm(formula = income ~ age + workclass + educaitonal-num + 
                   martial status + occupation + race + gender + 
                   capital-gain + capital-loss + hours-per-week + 
                   native-country, data = data, family = "binomial")

Sample dataset: 1

I am still fairly new to R so I apologize for any rookie mistake!

Roman
  • 4,744
  • 2
  • 16
  • 58
peterpra
  • 65
  • 1
  • 4
  • My goal is to create a logit model that reflects the probability of an individual's income to be above or below 50k, given their age, occupation, workclass, etc. – peterpra Oct 22 '18 at 17:40
  • I believe that matrix.model() is what you want. I'm just not positive if it uses one hot encoding for categorical variables. – meh Oct 22 '18 at 18:26
  • Welcome to Stackoverflow, @peterpra! It would be better if you could post your data via `dput(head(df))` into the original question or create mock data. [See here for reference](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Roman Oct 27 '18 at 10:31
  • You may find the below link useful: [https://stackoverflow.com/questions/11952706/generate-a-dummy-variable](https://stackoverflow.com/questions/11952706/generate-a-dummy-variable) – CrookedNoob Nov 15 '18 at 09:44
  • You may find this link useful: [https://stackoverflow.com/questions/11952706/generate-a-dummy-variable](https://stackoverflow.com/questions/11952706/generate-a-dummy-variable) – CrookedNoob Nov 15 '18 at 09:46

3 Answers3

2

R very nicely one hot encodes categorical variables internally when you wrap the variable in the as.factor() function. Question was answered btw already in categorical variable in logistic regression in r

2

With data.table and mltools:

df <- as.data.table(df)
df_oh <- one_hot(df)

Result & Explanation

head(df_oh)
   age education_level marital_status_Divorced marital_status_Married marital_status_Never marital_status_Widowed occupation_Admin occupation_Banking occupation_Farming occupation_Fishing occupation_Poledancing gender_Man gender_Unicorn gender_Woman    hours income_<=50K income_>50K
1:  26              12                       0                      0                    0                      1                0                  0                  0                  0                      1          0              0            1 39.69357            0           1
2:  70              12                       0                      0                    0                      1                0                  0                  0                  0                      1          1              0            0 39.35318            0           1
3:  21              14                       1                      0                    0                      0                1                  0                  0                  0                      0          0              0            1 40.72573            1           0
4:  56               1                       0                      1                    0                      0                0                  1                  0                  0                      0          1              0            0 39.04525            0           1
5:  81               2                       0                      0                    0                      1                0                  0                  1                  0                      0          0              1            0 39.21665            1           0
6:  38               5                       0                      0                    0                      1                1                  0                  0                  0                      0          1              0            0 39.94481            1           0

What one_hot() is doing is taking all factor variables (i.e., not numeric, not character, etc.) of a data table and one-hotting them. It needs a data table (and not, say, a data frame), because data tables provide some features/concepts that help with flexibility and speed.

If you check the documentation under ?one_hot you will see that the function can also treat NAs pretty nicely (if this is a concern in your data).

If you have any questions, please feel free to add a comment.

Reproduction

# Load libraries
library(data.table)
library(mltools)

# Set seed for reproducibility
set.seed(1701)

# Create mock data frame
df <- data.frame(
    age = sample(18:85, 50, replace = TRUE),
    education_level = sample(1:15, 50, replace = TRUE),
    marital_status = sample(c("Never", "Married", "Divorced", "Widowed"), 50, replace = TRUE),
    occupation = sample(c("Admin", "Farming", "Poledancing", "Fishing", "Banking"), 50, replace = TRUE),
    gender = sample(c("Man", "Woman", "Unicorn"), 50, replace = TRUE),
    hours = rnorm(50, 40, 1),
    income = sample(c("<=50K", ">50K"), 50, replace = TRUE))
Resulting in:
> head(df)
  age education_level marital_status  occupation  gender    hours income
1  26              12        Widowed Poledancing   Woman 39.69357   >50K
2  70              12        Widowed Poledancing     Man 39.35318   >50K
3  21              14       Divorced       Admin   Woman 40.72573  <=50K
4  56               1        Married     Banking     Man 39.04525   >50K
5  81               2        Widowed     Farming Unicorn 39.21665  <=50K
6  38               5        Widowed       Admin     Man 39.94481  <=50K
Roman
  • 4,744
  • 2
  • 16
  • 58
  • I can't reproduce the same result of your example. The categorical columns are didn't change after applying `one_hot` function. Did the function change it behaviour? @Roman – falamiw May 27 '23 at 15:14
1

Install the library dummies

Example:

library(dummies)
# example data 
df1 <- data.frame(id = 1:4, year = 1991:1994)
df1 <- cbind(df1, dummy(df1$year, sep = "_"))

This will generate the dummy variables as below:

df1
#   id year df1_1991 df1_1992 df1_1993 df1_1994
# 1  1 1991        1        0        0        0
# 2  2 1992        0        1        0        0
# 3  3 1993        0        0        1        0
# 4  4 1994        0        0        0        1
CrookedNoob
  • 100
  • 1
  • 11