How to score a data.frame based on another data.frame?

Question

I'm looking for help on how to append scores to a new data set of the same data based on already discovered patterns from a training data set. Example of what I am looking to do (take from another one of my posts):

Here is a sample data set which outputs the means of some fake online shopper data.

require(magrittr)
require(dplyr)    

set.seed(123)
dat = data.frame(email=sample(c("yahoo", "gmail"), 10000, replace=T),
                 browser=sample(c("mozilla", "ie"), 10000, replace=T),
                 country=sample(c("usa", "canada"), 10000, replace=T),
                 money=runif(10000))  
dat.withmean <- dat %>%
  group_by(email, browser, country) %>%
  summarize(mean = mean(money))

# email browser country      mean
# 1 gmail      ie  canada 0.5172424
# 2 gmail      ie     usa 0.4921908
# 3 gmail mozilla  canada 0.4934892
# 4 gmail mozilla     usa 0.4993923
# 5 yahoo      ie  canada 0.5013214
# 6 yahoo      ie     usa 0.5098280
# 7 yahoo mozilla  canada 0.4985357
# 8 yahoo mozilla     usa 0.4919743

Now, let's say we have a new data set that looks like this:

newdat = data.frame(email=sample(c("yahoo", "gmail"), 10000, replace=T),
                 browser=sample(c("mozilla", "ie"), 10000, replace=T),
                 country=sample(c("usa", "canada"), 10000, replace=T)) 

head(newdat, n=10)

#   email browser country
#1  gmail      ie     usa
#2  gmail      ie     usa
#3  gmail mozilla  canada
#4  yahoo mozilla  canada
#5  gmail      ie  canada
#6  yahoo mozilla  canada
#7  yahoo mozilla  canada
#8  gmail      ie     usa
#9  yahoo mozilla  canada
#10 gmail mozilla  canada
#... 10,000 rows...

How can I loop through newdat and check if any combination of columns from newdat matches any rows from dat and then if it does do something like append the value from the "mean" column in dat?

@infominer Thanks that helps me to start looking in the right direction. Still wondering if there is some "tried and true" simple method out there that someone uses (like a function) — Micro, Apr 16 '14 at 16:29
also look at `?merge` and maybe if you put some sample data from both data.frames and show what you tried we could help, See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — infominer, Apr 16 '14 at 16:31
@infominer merge may be what I want, but can you merge based on 3 columns? — Micro, Apr 16 '14 at 19:00

score 1 · Accepted Answer · answered Apr 16 '14 at 18:46

1

Do this

dat.withmean <- dat %>%
               group_by(email, browser, country) %>%
               summarize(mean = mean(money))

Now we use merge, this will "append" or add a column called mean for every combination

newdat.withmean <- merge(newdat,dat.withmean) #by default, data.frames are merged on the columns they both have.

read ?merge for more details

answered Apr 16 '14 at 18:46

infominer

1,981
13
17

Awesome! Works great. I noticed that it also sorts dat.withmean after the merge. Thanks a lot. – Micro Apr 16 '14 at 19:10
Merge is not restricted to number of columns, it is restricted to two data.frames though. If you need to merge multiple data.frames look at this answer http://stackoverflow.com/a/22332076/2747709 – infominer Apr 16 '14 at 19:28

score 1 · Answer 2 · answered May 18 '14 at 17:47

1

You don't even need the temporary variable:

result <-     
  dat %>%
  group_by(email, browser, country) %>%
  summarize(mean = mean(money)) %>%
  merge(newdat)

And you might want to use dplyr's *join family of functions for speeed.

answered May 18 '14 at 17:47

Stefan

1,835
13
20

How to score a data.frame based on another data.frame?

2 Answers2