1

I have 2 data frames with multiple factor columns. One is the base data frame and the other is the final data frame. I want to update the levels of the base data frame using the final data frame.

Consider this example:

base <- data.frame(product=c("Business Call", "Business Transactional", 
                             "Monthly Non-Compounding and Standard Non-Compounding",
                             "OCR based Call", "Offsale Call", "Offsale Savings",
                             "Offsale Transactional", "Out of Scope","Personal Call"))
base$product <- as.factor(base$product)

final <- data.frame(product=c("Business Call", "Business Transactional", 
                              "Monthly Standard Non-Compounding", "OCR based Call", 
                              "Offsale Call", "Offsale Savings","Offsale Transactional", 
                              "Out of Scope","Personal Call", "You Money")) 
final$product <- as.factor(final$product)

What I would now want is for the final data base to have the same levels as base and remove the levels which do not exist at all like "You Money". Whereas "Monthly Standard Non-Compounding" to be fuzzy matched

Eg:

levels(base$var1) <- "a" "b" "c"
levels(final$var1) <- "Aa" "Bb" "Cc"

Is there a way to overwrite the levels in base data using the final data using some kind of fuzzy match?

Like I want the final levels for both data to be the same. i.e.

levels(base$var1) <- "Aa" "Bb" "Cc"
levels(final$var1) <- "Aa" "Bb" "Cc"
Bruce Wayne
  • 471
  • 5
  • 18

1 Answers1

1

We could build our own fuzzyMatcher.

First, we'll need kinda vectorized agrep function,

agrepv <- function(x, y) all(as.logical(sapply(x, agrep, y)))

on which we build our fuzzyMatcher.

fuzzyMatcher <-  function(from, to) { 
  mc <- mapply(function(y) 
    which(mapply(function(x) agrepv(y, x), Map(levels, to))), 
    Map(levels, from))
  return(Map(function(x, y) `levels<-`(x, y), base, 
             Map(levels, from)[mc]))
}

final labels applied on base labels (note, that I've shifted columns to make it a little more sophisticated):

base[] <- fuzzyMatcher(final1, base1)
#    X1 X2
# 1  Aa Xx
# 2  Aa Xx
# 3  Aa Yy
# 4  Aa Yy
# 5  Bb Yy
# 6  Bb Zz
# 7  Bb Zz
# 8  Aa Xx
# 9  Cc Xx
# 10 Cc Zz

Update

Based on the new provided data above it'll make sense to use another vectorized agrepv2(), which, used with outer(), enables us to apply agrep on all combinations of the levels of both vectors. Hereafter colSums that equal zero give us non-matching levels and which.max the matching levels of the target data frame final. We can use these two resulting vectors on the one hand to delete unused rows of final, on the other hand to subset the desired levels of the base data frame in order to rebuild the factor column.

# add to mimic other columns in data frame
base$x <- seq(nrow(base))
final$x <- seq(nrow(final))

# some abbrevations for convenience
p1 <- levels(base$product)
p2 <- levels(final$product)

# agrep
AGREPV2 <- Vectorize(function(x, y, ...) agrep(p2[x], p1[y]))  # new vectorized agrep 
out <- t(outer(seq(p2), seq(p1), agrepv2, max.distance=0.9))  # apply `agrepv2`
del.col <- grep(0, colSums(apply(out, 2, lengths))) # find negative matches
lvl <- unlist(apply(out, 2, which.max))  # find positive matches
lvl <- as.character(p2[lvl])  # get the labels

# delete "non-existing" rows and re-generate factor with new labels
transform(final[-del.col, ], product=factor(product, labels=lvl))
#                  product x
# 1          Business Call 1
# 2 Business Transactional 2
# 4         OCR based Call 4
# 5           Offsale Call 5
# 6        Offsale Savings 6
# 7  Offsale Transactional 7
# 8           Out of Scope 8
# 9          Personal Call 9

Data

base1 <- structure(list(X1 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 
3L, 3L), .Label = c("a", "b", "c"), class = "factor"), X2 = structure(c(1L, 
1L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 3L), .Label = c("x", "y", "z"
), class = "factor")), row.names = c(NA, -10L), class = "data.frame")

final1 <- structure(list(X1 = structure(c(1L, 3L, 1L, 1L, 2L, 3L, 2L, 1L, 
2L, 2L, 3L, 3L, 2L, 2L, 2L), .Label = c("Xx", "Yy", "Zz"), class = "factor"), 
    X2 = structure(c(2L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 
    2L, 2L, 2L, 2L, 3L), .Label = c("Aa", "Bb", "Cc"), class = "factor")), row.names = c(NA, 
-15L), class = "data.frame")
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • It doesn't work when the order isn't the same. You think there is a way? Also, when the levels aren't equal – Bruce Wayne Jul 10 '19 at 18:22
  • 1
    @BruceWayne It would be great if you could design a self-contained [minimal reproducible example](https://stackoverflow.com/a/5963610/6574038). – jay.sf Jul 10 '19 at 18:27
  • Sure! Let me try – Bruce Wayne Jul 10 '19 at 19:17
  • I have updated with a new example which is exactly what I would like to replicate – Bruce Wayne Jul 10 '19 at 20:42
  • @BruceWayne Great, see update. I didn't know what you want to do with the "non existing levels" since they do exist. I decided to delete the rows. Another way could be to set them to `NA` or name them `"unmatched"`? – jay.sf Jul 10 '19 at 22:50
  • But you're missing the product "Monthly Standard Non-Compounding" which should occur as a fuzzy match @jay.sf – Bruce Wayne Jul 10 '19 at 22:52
  • @BruceWayne sorry you wrote "remove the levels which do not exist at all like "You Money" and "Monthly Standard Non-Compounding"". I'm sure you're able to fine-tune it to your needs. – jay.sf Jul 10 '19 at 22:54
  • @BruceWayne you're welcome, quite interesting question! – jay.sf Jul 10 '19 at 23:00