3

I am continuing to work on some data cleaning practice with some animal shelter data. My goal here is to shrink down the number of breed categories.

I am using each breed category as a partial pattern match against the outgoing$Single.Breed data frame column. So, there are cases where the breed will just be Chihuahua, but it may also be Long Hair Chihuahua. (Hence, my use of grepl.) Thus, anything containing a breed category would be represented in a different column by said category. Furthermore, I also need to add the cat breed categories...making for an even messier bunch of code.

The code below is my "solution", but it's quite clunky. Is there a better, slicker and/or more efficient way to accomplish this?

BreedCategories <- ifelse(outgoing$New.Type == "Dog",
           ifelse(grepl("Chihuahua",outgoing$Single.Breed, ignore.case = TRUE), "Chihuahua",
           ifelse(grepl("Pit Bull",outgoing$Single.Breed, ignore.case = TRUE), "Pit Bull",
           ifelse(grepl("Terrier",outgoing$Single.Breed, ignore.case = TRUE), "Terrier",
           ifelse(grepl("Shepherd",outgoing$Single.Breed, ignore.case = TRUE), "Shepherd",
           ifelse(grepl("Poodle",outgoing$Single.Breed, ignore.case = TRUE), "Poodle",
           ifelse(grepl("Labrador|Retriever",outgoing$Single.Breed, ignore.case = TRUE),"Labrador",
           "Other")))))),"Cat")
kimbekaw
  • 55
  • 1
  • 6
  • 1
    You need to include enough data for your question to be [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), but check out `dplyr::case_when`. – alistaire Aug 30 '16 at 01:26
  • If all you are trying to do it check if a breed name is listed your don't need the `ifelse`, again a reproducible example plus your desired output would be ideal – Nate Aug 30 '16 at 01:37

1 Answers1

4

Create a data.frame that maps between the regular expression and what the breed is

map <- data.frame(
    pattern=c(
        "Chihuahua", "Pit Bull", "Terrier", "Shepherd",
        "Poodle", "Labrador|Retriever", "Other"),
    isa=c(
        "Chihuahua", "Pit Bull", "Terrier", "Shepherd",
        "Poodle", "Labrador", "Other"),
    stringsAsFactors=FALSE)

and some data

outgoing <- data.frame(Single.Breed=c(map$isa, "Pit Bull Poodle", "Pug"),
                       stringsAsFactors=FALSE)

For the program, use vapply() and grepl() to match each pattern to the data; the use of grepl() means that the result is a matrix, with rows corresponding to each entry

isa <- vapply(map$pattern, grepl, logical(nrow(outgoing)), outgoing$Single.Breed)
if (any(rowSums(isa) > 1))
    warning("ambiguous breeds: ", outgoing$Single.Breed[rowSums(isa) != 1])

Use max.col() to visit each row and retrieve the best (last) match (which happens to be 'Other' if there are no matches).

outgoing$BreedCategory <- map$isa[max.col(isa, "last")]

Here's the result

> isa <- vapply(map$pattern, grepl, logical(nrow(outgoing)), outgoing$Single.Breed)
> if (any(rowSums(isa) > 1))
+     warning("ambiguous breeds: ", outgoing$Single.Breed[rowSums(isa) != 1])
Warning message:
ambiguous breeds: Pit Bull Poodle 
> outgoing$BreedCategory <- map$isa[max.col(isa, "last")]
> outgoing
     Single.Breed BreedCategory
1       Chihuahua     Chihuahua
2        Pit Bull      Pit Bull
3         Terrier       Terrier
4        Shepherd      Shepherd
5          Poodle        Poodle
6        Labrador      Labrador
7           Other         Other
8 Pit Bull Poodle        Poodle
9             Pug         Other

I guess the approach is appealing because it more clearly separates the 'data' (regex and input breeds) from the 'program' (grepl() and max.col()).

The handling of 'Other' seems a little fragile -- what if you forget that it is supposed to be the last element of map? One possibility is to create an indicator variable that tests the row sums of isa, and uses this to conditionally assign the breed

test = rowSums(isa)
outgoing$BreedCategory[test == 0] = "Other"
outgoing$BreedCategory[test == 1] = map$isa[max.col(isa)][test == 1]
outgoing$BreedCategory[test > 1] = "Mixed"

The above is not very space efficient (the matrix transforms your length n data to an n x # of regex matrix), but seems likely to get the job done for say 1M input rows.

dplyr::case_when() seems to require that you write many grepl() statements, which is error prone.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • I like this - you could also use `map$isa[max.col(isa, "first")]` to avoid the `apply` statement. – thelatemail Aug 30 '16 at 01:52
  • @thelatemail thanks! I'll admit to not know about `max.col()` (odd that there isn't a `max.row()`). – Martin Morgan Aug 30 '16 at 01:59
  • Thank you! This was very helpful and got the job done! This'll definitely be another clever strategy that I will add to my R toolbox for future use! I agree that 'Other' is rather fragile. The indicator variable is a nice way to solve it, disregarding the space inefficiency. Fortunately, my data set isn't massively huge, so that variable is definitely an option if necessary. – kimbekaw Sep 01 '16 at 22:46