Replace columns in data frame using another data frame as a map?

Question

In my specific case, I have two data frames:

> a
         column
1     red apple
2       red car
3  yellow train
4        random
5 random string
6    blue water
7         thing

> map
       x     y
1    red color
2   blue color
3 yellow color
4 random other
5  thing other

The result I want is:

> a
         column
1         color
2         color
3         color
4         other
5         other
6         color
7         other

I know that this post gives example of using merge Replacing values in data frame column using another data frame. But in my case, there is a slight difference in that:

I want check if the substring key of column x is contained in any "column" of a.
The map data from is a many -> single mapping.

How do I write something that will efficiently do this for me? The two data frames in reality are bigger. A is 30k rows and map is 30x2

EDIT: The substring can start anywhere in "column" not necessarily first word

Can you provide an example which we can copy/paste? You can use `dput()`, for example. — Roman Luštrik, Oct 17 '16 at 13:02
How come "thing" gets the output "other"? It doesn't seem to match any row in `map`. — mathematical.coffee, Oct 17 '16 at 13:04

score 1 · Answer 1 · answered Oct 17 '16 at 13:48

Reproducible example:

a <- data.frame(column=c('red apple', 'red car', 'yellow train', 'random', 'random string', 'blue water', 'thing'), stringsAsFactors=F)
map <- data.frame(x=c('red', 'blue', 'yellow', 'random'), y=c('color', 'color', 'color', 'other'))

There are a few options. I could think of two (I'm sure there are more) and I'll show you how to time them to compare their performance. You will probably have to try the timing on your own specific data, as which method is fastest might change depending on e.g. how big map$x is compared to a, or simply the size of a or map.

if you know the match (if any) is always on the first word, then you can skip regex and just use strsplit to grab that first word.
otherwise, regex can help you here (and there are various ways to do the regex).
note pmatch won't really work here, because you are trying to match many longer strings against fewer shorter ones.
data.table is the usual go-to for very fast processing of large data. I think the regex may be the limiting factor here though, so I'm not sure that you will get any speed up that way.

.

# rbenchmark library to compare times
library(rbenchmark)
benchmark(firstword={
  # extract first word; match exactly against the map
  # probably fastest; but "dumbest" unless you know the first word
  #  is always the match
  firstword <- vapply(a$column, function (x) strsplit(x, ' ')[[1]][1], '', USE.NAMES=F) 
  out.firstword <<- map$y[match(firstword, map$x)]
},
regex = {
  # regex option: find the matching word, then use `match`
  #  will have problems if any of map$x has regex special characters.
  regex <- sprintf('^.*\\b(%s)\\b.*$', paste(map$x, collapse='|')) # ^.*\b(red|blue|yellow|random)\b.*$  
  out.regex <- map$y[match(gsub(regex, '\\1', a$column), map$x)]
},
replications=100)

# check we at least agree on the output and get the expected output
all.equal(out.regex, out.firstword)
all.equal(as.character(out.regex), c('color', 'color', 'color', 'other', 'other', 'color', NA))

Note that if you are benchmarking on your big data, you might want to have fewer replications! You don't want to sit around waiting for years... Also, note that the last row returns "NA" not other, because the string "thing" doesn't match anything in your map.

This returns

       test replications elapsed relative user.self sys.self user.child sys.child
1 firstword          100   0.010    1.111         0        0          0         0
2     regex          100   0.009    1.000         0        0          0         0

So for your particular example data, the regex method is faster - but as mentioned previously, it'll all depend on your specific data [the nature of this example is that the datasets are small so everything is about as fast as the other] so your mileage may vary.

I get the exact same timings for those, very small, even after switching to microbenchmark, they're essentially the same (a slight edge for `firstword`). I think you need a bigger example, or perhaps these are effectively the same. — Frank, Oct 17 '16 at 15:04
I've posted what I'm seeing here: http://chat.stackoverflow.com/rooms/25312/r-public You're right that the regex way is faster and that data.table doesn't seem to offer much improvement. — Frank, Oct 17 '16 at 15:13
@qwer the "\b" means "word boundary" so that (e.g.) "red" doesn't match "fred". The '(red|blue|yellow|random)' means "match red or blue or yellow or random". The '^.*' and '.*$' catch everything else at the start and end of the string, so that the matching string is replaced by just the match itself. — mathematical.coffee, Oct 17 '16 at 23:01

Amit Kohli · Answer 2 · 2016-10-17T13:33:38.473

0

EDITED: to reflect comment from mathematical.coffee EDITED: switching order on match

First, we would need to clean the dataframe so that it would match the expected fields. So something like a$column1 <- gsub("red.+","red",a$column), and then rinse and repeat for the rest. I don't know how to match the "other" category... that will have to be done manually.

Once you have proper names, match would work well here:

the result you want should be obtained by using: map$y[match(a$column1,map$x)]. The inside part does the match, and the map$y part gives you the property you want.

edited Oct 17 '16 at 13:33

answered Oct 17 '16 at 13:04

Amit Kohli

2,860
2
24
44

1

This will only return something of the same length as `map$x`, not `a$column`. Additionally it's doing a full match, so `red apple` and `red` do not match. – mathematical.coffee Oct 17 '16 at 13:06
You could use pmatch for the purpose. – Chirayu Chamoli Oct 17 '16 at 13:13
Or you could use regex to find 'red' (with additional restraints). – Roman Luštrik Oct 17 '16 at 13:15
@mathematical.coffee, good point on the partial matches, edited for that... but the output will be the length of `a$column` because the inside part of the script will include many repeats... so `map$y[1,1,1,1,2,1,3,5,23,2,2,3]` etc – Amit Kohli Oct 17 '16 at 13:15
@AmitKohli Try it out. `match(x, y)` will only return something of length `x`, so `map$y[match(x, y)]` will also be of length `x`. – mathematical.coffee Oct 17 '16 at 13:22
@mathematical.coffee, you were right, I had put the match arguments wrong. The solution should be correct now – Amit Kohli Oct 17 '16 at 13:34

Replace columns in data frame using another data frame as a map?

2 Answers2