0

I have the following two tables:

Table_1
    ID  Interval
    1   10
    1   11
    2   11

and

Table_2
   ID   Interval    Rating
    1   10          0.5
    1   10          0.3
    1   11          0.1
    2   11          0.1
    2   11          0.2

The output table should look like this:

ID  Interval    Mean Ratings
1   10          0.4
1   11          0.1
2   11          0.15

My goal is to join both tables based on the two conditions/columns ID and interval. Given that I have several ratings for the same ID and interval, I want to compute the mean of the ratings. Whereas the IDs are unique (~9500), the interval repeats for different IDs (as seen in the table above). My current approach is the join function with 2 by arguments. How can I create a final table in which Table_1 and Table_2 are joined based on the condition ID and interval, and receive the average rating in the result column?

left_join(Table_1, Table_2, by = c("ID" = "ID", "Interval" = "Interval"))
smci
  • 32,567
  • 20
  • 113
  • 146
choufrise
  • 87
  • 8
  • 1
    Possible duplicate of [Can dplyr join on multiple columns or composite key?](https://stackoverflow.com/questions/26611717/can-dplyr-join-on-multiple-columns-or-composite-key) – tjebo Jun 08 '18 at 11:43
  • I would downvote but refrain from it now. You have asked a question having detailed the necessary steps for the solution - and a simple google would have helped you. There is heaps about merging/ summarising on SO. – tjebo Jun 08 '18 at 11:45
  • @Tjebo: the OP also needs a groupby-and-summarize step. Join won't compute the mean. – smci Jun 08 '18 at 21:31
  • @smci agree, and therefore it’s not an entire dup of the question but of rather two separate questions which are easy to find. PS if someone downvoted - it wasn’t me. – tjebo Jun 08 '18 at 21:35

3 Answers3

2

First of all you would need to summarize second table DT2 and then perform a right join with first table DT1.

library(data.table)

DT1[DT2[, .(Mean_Rating = mean(Rating)), .(ID, Interval)], on = c(ID = "ID", Interval = "Interval")]

which gives

   ID Interval Mean_Rating
1:  1       10        0.40
2:  1       11        0.10
3:  2       11        0.15


Sample data:

DT1 <- structure(list(ID = c(1L, 1L, 2L), Interval = c(10L, 11L, 11L
)), .Names = c("ID", "Interval"), class = c("data.table", "data.frame"
), row.names = c(NA, -3L))

DT2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Interval = c(10L, 
10L, 11L, 11L, 11L), Rating = c(0.5, 0.3, 0.1, 0.1, 0.2)), .Names = c("ID", 
"Interval", "Rating"), class = c("data.table", "data.frame"), row.names = c(NA, 
-5L))
Prem
  • 11,775
  • 1
  • 19
  • 33
1

You can achieve it with dplyr's left_join, group_by and then summarise.

library(dplyr)

table1 %>% 
  left_join(table2, by = c("ID", "Interval")) %>% 
  group_by(ID, Interval) %>% 
  summarise("Mean Ratings" = mean(Rating))

## A tibble: 3 x 3
## Groups:   ID [?]
#     ID Interval `Mean Ratings`
#  <int>    <int>          <dbl>
#1     1       10           0.4
#2     1       11           0.1
#3     2       11           0.15

data

table1 <- read.table(header = T, text="ID  Interval
    1   10
    1   11
    2   11")

table2 <- read.table(header = T, text = "ID Interval    Rating
        1   10          0.5
        1   10          0.3
        1   11          0.1
        2   11          0.1
        2   11          0.2")
patL
  • 2,259
  • 1
  • 17
  • 38
0

You don't need join. Instead, bind your tables and use group & summarize from dplyr. The following achieves what you asked for:

library(dplyr)

table_1 <- data.frame("ID"= c(1,1,2),"Interval"=c (10,11,11),"Rating"= c(NA,NA,NA))
table_2 <- data.frame("ID"= c(1,1,1,2,2),"Interval"= c(10,10,11,11,11),"Rating"= c(0.5,0.3,0.1,0.1,0.2))

df1 <- bind_rows(table_1,table_2) %>% group_by(ID,Interval) %>% summarise("Mean Ratings" = mean(Rating,na.rm = TRUE))
Stelios K.
  • 313
  • 4
  • 14
  • This is wrong. You need to merge here because If in `Table 2` you have `ID = 3` and `interval = whatever`, then you want to merge so `ID = 3` is not included in the analysis – Sotos Jun 08 '18 at 09:29
  • I can not understand your comment. If a set of IDs exist that are not needed the user could just filter them out e.g. filter(df1, ID %in% c(1,2)). – Stelios K. Jun 08 '18 at 09:53
  • Try it with thousands of IDs... This is why merge exists. – Sotos Jun 08 '18 at 10:15
  • @Sotos: Thanks for your remark. Could you please provide your approach to achieve the desired output? – choufrise Jun 08 '18 at 10:27
  • Any of the other two answers will do. – Sotos Jun 08 '18 at 11:22