1

I need to drop variables from a data frame in R. My data has a column with 18 factors:

  1. agriculture
  2. fisheries ...
  3. unclassified

I need to remove factor #18 before creating dummy variables to say "the person X works in the Y industry". This is, I need to keep only the first 17 levels (the classified levels)

In Stata to remove the level would be

drop if rama1 == 99

(rama1 is the factor column and 99 is "unclassified")

Then to create the dummies in Stata (one binary variable per industry) I run:

quietly tabulate rama1, generate(rama1_)

that in R is:

for(i in unique(data$rama1)) {
data[paste("type", i, sep="")] <- ifelse(data$rama1 == i, 1, 0)
}

any ideas? your help is highly welcome

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
pachadotdev
  • 3,345
  • 6
  • 33
  • 60
  • Welcome to SO. First of all you should read [here](http://stackoverflow.com/help/how-to-ask) about how to ask a good question; a good question has better changes to be solved and you to receive help. On the other hand a read of [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) is also good. It explains how to create a reproducible example in R. Help users to help you by providing a piece of your data a desired output and things you have tried so far. – SabDeM Aug 30 '15 at 20:52
  • why not just `data<-data[rama1!="unclassified"]`... – MichaelChirico Aug 30 '15 at 21:10
  • I've tried that but is doing nothing :S – pachadotdev Aug 30 '15 at 21:21
  • Removing all elements with a given factor level does not delete the level, you have to use `factor(...)` again to do that. – jlhoward Aug 30 '15 at 21:46
  • How to adapt data<-data[rama1!="unclassified"] so it applies to several levels, not just one? – Adel Aug 09 '21 at 07:54

3 Answers3

2

To remove levels, either way approached by BondedDust or jlhoward works fine. To create the dummy variables, it will depend on what you want/how you want it to be formulated.

For example, for the removed factor, do you want the rows to show up as <NA> or as 0.


Base R

The easiest way to do this is using model.matrix in base R. So building on the example by BondedDust;

df <- data.frame(x=as.factor(sample(LETTERS[1:5],100, replace=TRUE)), y=1:100)

# remove E and the level
is.na(df$x) <- df$x == "E"
df$x <- factor(df$x)

Yields this:

> head(df)
     x y
1    D 1
2    C 2
3    A 3
4 <NA> 4
5    D 5
6    A 6

Then, we can simply run model.matrix to get the dummy variables for our factor level. By default it will change all NAs to be 0.

> model.matrix(~x, df)
    (Intercept) xB xC xD
1             1  0  0  1
2             1  0  1  0
3             1  0  0  0
5             1  0  0  1
6             1  0  0  0
8             1  1  0  0
9             1  0  0  0
11            1  0  0  0
12            1  0  1  0

Caret

An alternative way is to use the caret package, which may give you more power when running these factors/releveling across test/holdout models.

It contains the dummyVars function which does this for you.

> xx <- dummyVars(~x, df)
> predict(xx, df)
    x.A x.B x.C x.D
1     0   0   0   1
2     0   0   1   0
3     1   0   0   0
4    NA  NA  NA  NA
5     0   0   0   1
6     1   0   0   0
7    NA  NA  NA  NA
chappers
  • 2,415
  • 14
  • 16
1

R also has a function to "drop" levels, named unsurprsingly, droplevels. From context, I'm guessing that Stata's drop is more like R's is.na<- in that it appears to be setting the items to missing within the column. To prevent R from displaying the now 'missing' levels you would need to first remove the values and then drop the levels.

The creation of multiple columns, one for each"dummy" is completely unnecessary. I suspect it is not needed in Stata, either. I think it's the sort of operation that one might carry over from SAS or SPSS. The regression and table operations in R will be done appropriately with a single column.

df <- data.frame(x=as.factor(sample(LETTERS[1:5],100, replace=TRUE)), y=1:100)
levels(df$x)
#[1] "A" "B" "C" "D" "E"
is.na(df$x) <- df$x == "E"
lm( y~x, df)
#--------------
Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept)           xB           xC           xD  
    49.3846      -0.7846       2.9838       2.7692  

If df1$rami is numeric as suggested by testing against 99, then it's not a factor anyway, and discussion of levels is not germane.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Stata's `drop` removes observations or variables from the dataset in memory; it does not set to missing. More interestingly, you are quite right that Stata has machinery that avoids creation of new variables in this case; it is called factor variable notation. – Nick Cox Aug 31 '15 at 07:03
0

Expanding my comment:

set.seed(1)
df <- data.frame(x=as.factor(sample(LETTERS[1:5],10, replace=TRUE)), y=1:10)
levels(df$x)
# [1] "A" "B" "C" "D" "E"
df <- df[df$x!="E",]        # remove all rows with df$x=="E"
levels(df$x)                # level E remains
# [1] "A" "B" "C" "D" "E"
df$x <- factor(df$x)        # get rid of it...
levels(df$x)
# [1] "A" "B" "C" "D"

Note that as.factor(...) would not have worked.

jlhoward
  • 58,004
  • 7
  • 97
  • 140