Dummy or indicator variables are used to include categorical or qualitative variables in a regression model.
Questions tagged [dummy-variable]
868 questions
160
votes
6 answers
How to force R to use a specified factor level as reference in a regression?
How can I tell R to use a certain level as reference if I use binary explanatory variables in a regression?
It's just using some level by default.
lm(x ~ y + as.factor(b))
with b {0, 1, 2, 3, 4}. Let's say I want to use 3 instead of the zero that…

Matt Bannert
- 27,631
- 38
- 141
- 207
143
votes
5 answers
What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?
I'm learning different methods to convert categorical variables to numeric for machine-learning classifiers. I came across the pd.get_dummies method and sklearn.preprocessing.OneHotEncoder() and I wanted to see how they differed in terms of…

O.rka
- 29,847
- 68
- 194
- 309
66
votes
11 answers
Dummy variables when not all categories are present
I have a set of dataframes where one of the columns contains a categorical variable. I'd like to convert it to several dummy variables, in which case I'd normally use get_dummies.
What happens is that get_dummies looks at the data available in each…

Berne
- 793
- 1
- 7
- 8
53
votes
5 answers
Pandas: Get Dummies
I have the following dataframe:
amount catcode cid cycle date di feccandid type
0 1000 E1600 N00029285 2014 2014-05-15 D H8TX22107 24K
1 5000 G4600 N00026722 2014 2013-10-22 D H4TX28046 …

Collective Action
- 7,607
- 15
- 45
- 60
52
votes
7 answers
Keep same dummy variable in training and testing data
I am building a prediction model in python with two separate training and testing sets. The training data contains numerical type categorical variable, e.g., zip code,[91521,23151,12355, ...], and also string categorical variables, e.g., city…

nimning
- 527
- 1
- 5
- 5
23
votes
2 answers
Converting pandas column of comma-separated strings into dummy variables
In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:
0 'a'
1 'a,b,c'
2 'a,b,d'
3 'd'
4 'c,d'
Ultimately, I'd want to have binary…

breakbotz
- 397
- 1
- 3
- 8
23
votes
2 answers
how to get pandas get_dummies to emit N-1 variables to avoid collinearity?
pandas.get_dummies emits a dummy variable per categorical value. Is there some automated, easy way to ask it to create only N-1 dummy variables? (just get rid of one "baseline" variable arbitrarily)?
Needed to avoid co-linearity in our dataset.

ihadanny
- 4,377
- 7
- 45
- 76
21
votes
1 answer
Creating dummy variables in R data.table
I am working with an extremely large dataset in R and have been operating with data frames and have decided to switch to data.tables to help speed up with operations. I am having trouble understanding the J operations, in particular I'm trying to…

user2792957
- 319
- 2
- 5
17
votes
1 answer
Handling unknown values for label encoding
How can I handle unknown values for label encoding in sk-learn?
The label encoder will only blow up with an exception that new labels were detected.
What I want is the encoding of categorical variables via one-hot-encoder. However, sk-learn does not…

Georg Heiler
- 16,916
- 36
- 162
- 292
16
votes
2 answers
Linear regression with dummy/categorical variables
I have a set of data. I have use pandas to convert them in a dummy and categorical variables respectively. So, now I want to know, how to run a multiple linear regression (I am using statsmodels) in Python?. Are there some considerations or maybe I…

Héctor Alonso
- 181
- 1
- 2
- 12
15
votes
4 answers
How to summarize data by-group, by creating dummy variables as the collapsing method
I'm trying to summarize a dataset by groups, to have dummy columns for whether each group's values appear among the data's ungrouped most frequent values.
As an example, let's take flights data from nycflights13.
library(dplyr, warn.conflicts =…

Emman
- 3,695
- 2
- 20
- 44
13
votes
1 answer
How to create dummy variable columns for thousands of categories in Google BigQuery?
I have a simple table with 2 columns: UserID and Category, and each UserID can repeat with a few categories, like so:
UserID Category
------ --------
1 A
1 B
2 C
3 A
3 C
3 B
I want to "dummify"…

wubr2000
- 855
- 2
- 8
- 10
11
votes
6 answers
Split a string column into several dummy variables
As a relatively inexperienced user of the data.table package in R, I've been trying to process one text column into a large number of indicator columns (dummy variables), with a 1 in each column indicating that a particular sub-string was found…

user2262318
- 173
- 7
10
votes
3 answers
R: create dummy variables based on a categorical variable *of lists*
I have a data frame with a categorical variable holding lists of strings, with variable length (it is important because otherwise this question would be a duplicate of this or this), e.g.:
df <- data.frame(x = 1:5)
df$y <- list("A", c("A", "B"),…

Giora Simchoni
- 3,487
- 3
- 34
- 72
9
votes
2 answers
multiple seasonality Time series analysis in Python
I have a daily time series dataset that I am using Python SARIMAX method to predict for future. But I do not know how to write codes in python that accounts for multiple seasonalities. As far as I know, SARIMAX takes care of only one seasonality but…

Samuel1985
- 91
- 1
- 2