Data cleaning in Excel sheets using R

Question

I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S, Computer Science). So how can I bring all of them into single notation?

score 6 · Answer 1 · edited May 23 '17 at 11:58

6

The car package has a recode function. See it's help page for worked examples.

In fact an argument could be made that this should be a closed question:

Why is recode in R not changing the original values?

How to recode a variable to numeric in R?

Recode/relevel data.frame factors with different levels

And a few more questions easily identifiable with a search: [r] recode

EDIT: I liked Marek's comment so much I decided to make a function that implemented it. (Factors have always been one of those R-traps for me and his approach seemed very intuitive.) The function is designed to take character or factor class input and return a grouped result that also classifies an "all_others" level.

my_recode <- function(fac, levslist){ nfac <- factor(fac);
    inlevs <- levels(nfac);
    othrlevs <- inlevs[ !inlevs %in% unlist(levslist) ]
      # levslist of the form ::::    list(
      #     animal = c("cow", "pig"),
      #     bird = c("eagle", "pigeon") )
 levels(nfac)<- c(levslist, all_others =othrlevs); nfac}

 df <- data.frame(name = c('cow','pig','eagle','pigeon', "zebra"), 
              stringsAsFactors = FALSE)
 df$type <- my_recode(df$name, list(
     animal = c("cow", "pig"),
     bird = c("eagle", "pigeon") ) )
 df
#-----------
    name       type
1    cow     animal
2    pig     animal
3  eagle       bird
4 pigeon       bird
5  zebra all_others

edited May 23 '17 at 11:58

Community

1
1

answered Sep 10 '11 at 20:08

IRTFM

258,963
21
364
487

I dont want to recode or map to some other value , instead I need to store in a single format, say above eg, single format should be C.S.E. – Sandesh Nagularapu Sep 10 '11 at 21:08
@Sandesh Perhaps you haven't explained yourself clearly enough, because based on what you've written, recoding a factor variable's levels is precisely what you've described. – joran Sep 10 '11 at 21:26
I agree with @DWIN and @Joran `recode` is the way to go. I had a similar analysis recently with a hospital name in about six different formats, including abbreviations. I recoded all to the singular format I required. – John Sep 11 '11 at 04:59
1

Another one to collection: [Case statement equivalent in R](http://stackoverflow.com/q/4622060/168747). But I recommend [my way](http://stackoverflow.com/questions/4622060/case-statement-equivalent-in-r/7390653#7390653) over `car:recode`. – Marek Sep 13 '11 at 09:31
@Marek. You convinced me. That looks to be the cleanest way. No extra packages. I suppose "car" is a good package but it masks functions I use in "Hmisc", so I try to avoid it if possible. – IRTFM Sep 13 '11 at 13:56

score 3 · Answer 2 · edited May 25 '18 at 18:51

You want a way to clean your data and you specify R. Is there a reason for it? (automation, remote control [console], ...)

If not, I would suggest Open Refine. It is a great tool exactly for this job. It is not hosted, you can safely download it and run against your dataset (xls/xlsx work fine), you then create a text facet and group away.

It uses advanced algorithms (and even gives you a choice) and is really helpful. I have cleaned a lot of data in no time.

The videos at the official web site are useful.

score 2 · Answer 3 · answered Sep 10 '11 at 22:36

There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.

You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.

The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean

your_df <- data.frame(ID=1:2000)
your_df$BranchNames <- sample(letters,2000, replace=T)
your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
unique.names <- sort(unique(your_df$BranchNames))

Now that we have a sorted list of unique values, we can create a listing of recodes:

Let's say we wanted to rename A through G as just A

your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"

And you'd repeat the process above eliminating or group the unique names as appropriate.

Data cleaning in Excel sheets using R

3 Answers3

Linked