3

In my specific case, I have two data frames:

> a
         column
1     red apple
2       red car
3  yellow train
4        random
5 random string
6    blue water
7         thing

> map
       x     y
1    red color
2   blue color
3 yellow color
4 random other
5  thing other

The result I want is:

> a
         column
1         color
2         color
3         color
4         other
5         other
6         color
7         other

I know that this post gives example of using merge Replacing values in data frame column using another data frame. But in my case, there is a slight difference in that:

  1. I want check if the substring key of column x is contained in any "column" of a.
  2. The map data from is a many -> single mapping.

How do I write something that will efficiently do this for me? The two data frames in reality are bigger. A is 30k rows and map is 30x2

EDIT: The substring can start anywhere in "column" not necessarily first word

Community
  • 1
  • 1
qwer
  • 223
  • 4
  • 13

2 Answers2

1

Reproducible example:

a <- data.frame(column=c('red apple', 'red car', 'yellow train', 'random', 'random string', 'blue water', 'thing'), stringsAsFactors=F)
map <- data.frame(x=c('red', 'blue', 'yellow', 'random'), y=c('color', 'color', 'color', 'other'))

There are a few options. I could think of two (I'm sure there are more) and I'll show you how to time them to compare their performance. You will probably have to try the timing on your own specific data, as which method is fastest might change depending on e.g. how big map$x is compared to a, or simply the size of a or map.

  • if you know the match (if any) is always on the first word, then you can skip regex and just use strsplit to grab that first word.
  • otherwise, regex can help you here (and there are various ways to do the regex).
  • note pmatch won't really work here, because you are trying to match many longer strings against fewer shorter ones.
  • data.table is the usual go-to for very fast processing of large data. I think the regex may be the limiting factor here though, so I'm not sure that you will get any speed up that way.

.

# rbenchmark library to compare times
library(rbenchmark)
benchmark(firstword={
  # extract first word; match exactly against the map
  # probably fastest; but "dumbest" unless you know the first word
  #  is always the match
  firstword <- vapply(a$column, function (x) strsplit(x, ' ')[[1]][1], '', USE.NAMES=F) 
  out.firstword <<- map$y[match(firstword, map$x)]
},
regex = {
  # regex option: find the matching word, then use `match`
  #  will have problems if any of map$x has regex special characters.
  regex <- sprintf('^.*\\b(%s)\\b.*$', paste(map$x, collapse='|')) # ^.*\b(red|blue|yellow|random)\b.*$  
  out.regex <- map$y[match(gsub(regex, '\\1', a$column), map$x)]
},
replications=100)

# check we at least agree on the output and get the expected output
all.equal(out.regex, out.firstword)
all.equal(as.character(out.regex), c('color', 'color', 'color', 'other', 'other', 'color', NA))

Note that if you are benchmarking on your big data, you might want to have fewer replications! You don't want to sit around waiting for years... Also, note that the last row returns "NA" not other, because the string "thing" doesn't match anything in your map.

This returns

       test replications elapsed relative user.self sys.self user.child sys.child
1 firstword          100   0.010    1.111         0        0          0         0
2     regex          100   0.009    1.000         0        0          0         0

So for your particular example data, the regex method is faster - but as mentioned previously, it'll all depend on your specific data [the nature of this example is that the datasets are small so everything is about as fast as the other] so your mileage may vary.

mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • '^.*\\b(%s)\\b.*$' means? – qwer Oct 17 '16 at 14:52
  • I get the exact same timings for those, very small, even after switching to microbenchmark, they're essentially the same (a slight edge for `firstword`). I think you need a bigger example, or perhaps these are effectively the same. – Frank Oct 17 '16 at 15:04
  • I've posted what I'm seeing here: http://chat.stackoverflow.com/rooms/25312/r-public You're right that the regex way is faster and that data.table doesn't seem to offer much improvement. – Frank Oct 17 '16 at 15:13
  • @qwer the "\b" means "word boundary" so that (e.g.) "red" doesn't match "fred". The '(red|blue|yellow|random)' means "match red or blue or yellow or random". The '^.*' and '.*$' catch everything else at the start and end of the string, so that the matching string is replaced by just the match itself. – mathematical.coffee Oct 17 '16 at 23:01
0

EDITED: to reflect comment from mathematical.coffee EDITED: switching order on match

First, we would need to clean the dataframe so that it would match the expected fields. So something like a$column1 <- gsub("red.+","red",a$column), and then rinse and repeat for the rest. I don't know how to match the "other" category... that will have to be done manually.

Once you have proper names, match would work well here:

the result you want should be obtained by using: map$y[match(a$column1,map$x)]. The inside part does the match, and the map$y part gives you the property you want.

Amit Kohli
  • 2,860
  • 2
  • 24
  • 44