How to collapse categories or recategorize variables?

Question

In R, I have 600,000 categorical variables, each of which is classified as "0", "1", or "2".

What I would like to do is collapse "1" and "2" and leave "0" by itself, such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1". In the end I only want "0" and "1" as categories for each of the variables.

Also, if possible, I would rather not create 600,000 new variables, if I can replace the existing variables with the new values that would be great!

What would be the best way to do this?

maja zaloznik · Answer 1 · 2012-01-29T22:48:10.170

I find this is even more generic using factor(new.levels[x]):

> x <- factor(sample(c("0","1","2"), 10, replace=TRUE)) 
> x
 [1] 0 2 2 2 1 2 2 0 2 1
Levels: 0 1 2
> new.levels<-c(0,1,1)
> x <- factor(new.levels[x])
> x
 [1] 0 1 1 1 1 1 1 0 1 1
Levels: 0 1

The new levels vector must the same length as the number of levels in x, so you can do more complicated recodes as well using strings and NAs for example

x <- factor(c("old", "new", NA)[x])
> x
 [1] old    <NA>   <NA>   <NA>   new <NA>   <NA>   old   
 [9] <NA>   new    
Levels: new old

John · Answer 2 · 2010-07-16T18:43:16.333

recode()'s a little overkill for this. Your case depends on how it's currently coded. Let's say your variable is x.

If it's numeric

x <- ifelse(x>1, 1, x)

if it's character

x <- ifelse(x=='2', '1', x)

if it's factor with levels 0,1,2

levels(x) <- c(0,1,1)

Any of those can be applied across a data frame dta to the variable x in place. For example...

 dta$x <- ifelse(dta$x > 1, 1, dta$x)

Or, multiple columns of a frame

 df[,c('col1','col2'] <- sapply(df[,c('col1','col2'], FUN = function(x) ifelse(x==0, x, 1))

rcs · Accepted Answer · 2010-07-16T18:35:07.223

5

There is a function recode in package car (Companion to Applied Regression):

require("car")    
recode(x, "c('1','2')='1'; else='0'")

or for your case in plain R:

> x <- factor(sample(c("0","1","2"), 10, replace=TRUE))
> x
 [1] 1 1 1 0 1 0 2 0 1 0
Levels: 0 1 2
> factor(pmin(as.numeric(x), 2), labels=c("0","1"))
 [1] 1 1 1 0 1 0 1 0 1 0
Levels: 0 1

Update: To recode all categorical columns of a data frame tmp you can use the following

recode_fun <- function(x) factor(pmin(as.numeric(x), 2), labels=c("0","1"))
require("plyr")
catcolwise(recode_fun)(tmp)

edited Jul 16 '10 at 18:35

answered Jul 16 '10 at 17:24

rcs

67,191
22
172
153

Thank you for the response! This is how I am applying it to my data specifically. My data is in the form of a data.frame, which I would like to maintain: data <- read.table("k.csv", header=TRUE, sep = ",") dta<- data[,1:30] col = dim(dta)[2] for (y in 1:col) { py<- factor(pmin(as.data.frame(dta[,y]), 2), labels=c("0","1")) py } Of course that results in an error - I am sure I am not applying it properly – CCA Jul 16 '10 at 18:21

score 2 · Answer 4 · answered Jun 26 '17 at 06:55

2

I liked the function in dplyr that can quickly recode values.

 library(dplyr)
 df$x <- recode(df$x, old = "new")

Hope this helps :)

answered Jun 26 '17 at 06:55

Megha John

153
1
12

score 0 · Answer 5 · answered Jun 04 '15 at 14:30

You could use the rec function of the sjmisc package, which can recode a complete data frame at once (given, that all variables have at least the same recode-values).

library(sjmisc)
mydf <- data.frame(a = sample(0:2, 10, T),
                   b = sample(0:2, 10, T),
                   c = sample(0:2, 10, T))

> mydf
   a b c
1  1 1 0
2  1 0 1
3  0 2 0
4  0 1 0
5  1 0 0
6  2 1 1
7  0 1 1
8  2 1 2
9  1 1 2
10 2 0 1

mydf <- rec(mydf, "0=0; 1,2=1")

   a b c
1  1 1 0
2  1 0 1
3  0 1 0
4  0 1 0
5  1 0 0
6  1 1 1
7  0 1 1
8  1 1 1
9  1 1 1
10 1 0 1

score 0 · Answer 6 · answered Nov 05 '21 at 18:15

0

A solution with forcats package from tidyverse

library(forcats)

> x <- factor(sample(c("0","1","2"), 10, replace=TRUE))
> x
[1] 1 1 1 0 1 0 2 0 1 0
Levels: 0 1 2
    
> fct_collapse(x, "1" = c("1", "2"))
[1] 1 1 1 0 1 0 1 0 1 0
Levels: 0 1

answered Nov 05 '21 at 18:15

Benoit Lamarsaude

375
3
11

score 0 · Answer 7 · answered Jan 29 '12 at 15:28

Note that if you just want the results to be 0-1 binary variables, you can forego factors altogether:

f <- sapply(your.data.frame, is.factor)
your.data.frame[f] <- lapply(your.data.frame[f], function(x) x != "0")

The second line can also be written more succinctly (but possibly more cryptically) as

your.data.frame[f] <- lapply(your.data.frame[f], `!=`, "0")

This turns your factors into a series of logical variables, with "0" mapping to FALSE and anything else mapping to TRUE. FALSE and TRUE will be treated as 0 and 1 by most code, which in turn should give essentially the same result in an analysis as using a factor with levels "0" and "1". In fact, if it doesn't give the same result, that would cast doubt on the correctness of the analysis....

How to collapse categories or recategorize variables?

7 Answers7

Linked