I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S, Computer Science). So how can I bring all of them into single notation?
3 Answers
The car package has a recode
function. See it's help page for worked examples.
In fact an argument could be made that this should be a closed question:
Why is recode in R not changing the original values?
How to recode a variable to numeric in R?
Recode/relevel data.frame factors with different levels
And a few more questions easily identifiable with a search: [r] recode
EDIT: I liked Marek's comment so much I decided to make a function that implemented it. (Factors have always been one of those R-traps for me and his approach seemed very intuitive.) The function is designed to take character or factor class input and return a grouped result that also classifies an "all_others" level.
my_recode <- function(fac, levslist){ nfac <- factor(fac);
inlevs <- levels(nfac);
othrlevs <- inlevs[ !inlevs %in% unlist(levslist) ]
# levslist of the form :::: list(
# animal = c("cow", "pig"),
# bird = c("eagle", "pigeon") )
levels(nfac)<- c(levslist, all_others =othrlevs); nfac}
df <- data.frame(name = c('cow','pig','eagle','pigeon', "zebra"),
stringsAsFactors = FALSE)
df$type <- my_recode(df$name, list(
animal = c("cow", "pig"),
bird = c("eagle", "pigeon") ) )
df
#-----------
name type
1 cow animal
2 pig animal
3 eagle bird
4 pigeon bird
5 zebra all_others
-
I dont want to recode or map to some other value , instead I need to store in a single format, say above eg, single format should be C.S.E. – Sandesh Nagularapu Sep 10 '11 at 21:08
-
@Sandesh Perhaps you haven't explained yourself clearly enough, because based on what you've written, recoding a factor variable's levels is precisely what you've described. – joran Sep 10 '11 at 21:26
-
I agree with @DWIN and @Joran `recode` is the way to go. I had a similar analysis recently with a hospital name in about six different formats, including abbreviations. I recoded all to the singular format I required. – John Sep 11 '11 at 04:59
-
1Another one to collection: [Case statement equivalent in R](http://stackoverflow.com/q/4622060/168747). But I recommend [my way](http://stackoverflow.com/questions/4622060/case-statement-equivalent-in-r/7390653#7390653) over `car:recode`. – Marek Sep 13 '11 at 09:31
-
@Marek. You convinced me. That looks to be the cleanest way. No extra packages. I suppose "car" is a good package but it masks functions I use in "Hmisc", so I try to avoid it if possible. – IRTFM Sep 13 '11 at 13:56
You want a way to clean your data and you specify R. Is there a reason for it? (automation, remote control [console], ...)
If not, I would suggest Open Refine. It is a great tool exactly for this job. It is not hosted, you can safely download it and run against your dataset (xls/xlsx work fine), you then create a text facet and group away.
It uses advanced algorithms (and even gives you a choice) and is really helpful. I have cleaned a lot of data in no time.
The videos at the official web site are useful.

- 211,554
- 25
- 370
- 453

- 502
- 1
- 3
- 12
There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.
You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.
The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean
your_df <- data.frame(ID=1:2000)
your_df$BranchNames <- sample(letters,2000, replace=T)
your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
unique.names <- sort(unique(your_df$BranchNames))
Now that we have a sorted list of unique values, we can create a listing of recodes:
Let's say we wanted to rename A through G as just A
your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"
And you'd repeat the process above eliminating or group the unique names as appropriate.

- 43,807
- 34
- 160
- 255