Reproducible example:
a <- data.frame(column=c('red apple', 'red car', 'yellow train', 'random', 'random string', 'blue water', 'thing'), stringsAsFactors=F)
map <- data.frame(x=c('red', 'blue', 'yellow', 'random'), y=c('color', 'color', 'color', 'other'))
There are a few options. I could think of two (I'm sure there are more) and I'll show you how to time them to compare their performance. You will probably have to try the timing on your own specific data, as which method is fastest might change depending on e.g. how big map$x
is compared to a
, or simply the size of a
or map
.
- if you know the match (if any) is always on the first word, then you can skip regex and just use
strsplit
to grab that first word.
- otherwise, regex can help you here (and there are various ways to do the regex).
- note
pmatch
won't really work here, because you are trying to match many longer strings against fewer shorter ones.
data.table
is the usual go-to for very fast processing of large data. I think the regex may be the limiting factor here though, so I'm not sure that you will get any speed up that way.
.
# rbenchmark library to compare times
library(rbenchmark)
benchmark(firstword={
# extract first word; match exactly against the map
# probably fastest; but "dumbest" unless you know the first word
# is always the match
firstword <- vapply(a$column, function (x) strsplit(x, ' ')[[1]][1], '', USE.NAMES=F)
out.firstword <<- map$y[match(firstword, map$x)]
},
regex = {
# regex option: find the matching word, then use `match`
# will have problems if any of map$x has regex special characters.
regex <- sprintf('^.*\\b(%s)\\b.*$', paste(map$x, collapse='|')) # ^.*\b(red|blue|yellow|random)\b.*$
out.regex <- map$y[match(gsub(regex, '\\1', a$column), map$x)]
},
replications=100)
# check we at least agree on the output and get the expected output
all.equal(out.regex, out.firstword)
all.equal(as.character(out.regex), c('color', 'color', 'color', 'other', 'other', 'color', NA))
Note that if you are benchmarking on your big data, you might want to have fewer replications! You don't want to sit around waiting for years...
Also, note that the last row returns "NA" not other, because the string "thing" doesn't match anything in your map.
This returns
test replications elapsed relative user.self sys.self user.child sys.child
1 firstword 100 0.010 1.111 0 0 0 0
2 regex 100 0.009 1.000 0 0 0 0
So for your particular example data, the regex method is faster - but as mentioned previously, it'll all depend on your specific data [the nature of this example is that the datasets are small so everything is about as fast as the other] so your mileage may vary.