dplyr lookup from same table

Question

I have data of how team members rated one another, from multiple teams. Each person has their own id number but also a team and rater number within the team like so:

  StudyID TeamID CATMERater Rated   Rating
    (int)  (int)      (int) (dbl)    (dbl)
1    2930    551          1     1 5.000000 #How rater 1 rated 1 (themselves)
2    2938    551          2     1 3.800000 #How rater 2 rated 1
3    2939    551          3     1 5.000000 #How rater 3 rated 1
4    2930    551          1     2 3.666667 #How rater 1 rated 2
5    2938    551          2     2 4.000000 #...
6    2939    551          3     2 3.866667
...

and so on. I got this format using tidyr and am trying to get a new column of the StudyID where the TeamID and person being rated are the same. This is what I've tried, but doesn't work because I'm not sure how to reference the same table:

edges %>% mutate(RatedStudyID = filter(edges, TeamID == TeamID & Rated == CATMERater))

Hopefully this makes sense, but I'd appreciate a recommendation to get headed in the right direction. If it is something with left_join how do I say where TeamID == TeamID?

Here's what I'd like to see in the end (mostly the last column though):

  StudyID TeamID CATMERater Rated   Rating   RatedStudyID
    (int)  (int)      (int) (dbl)    (dbl)   
1    2930    551          1     1 5.000000   2930
2    2938    551          2     1 3.800000   2930
3    2939    551          3     1 5.000000   2930
4    2930    551          1     2 3.666667   2938
5    2938    551          2     2 4.000000   2938
6    2939    551          3     2 3.866667   2938
...

dput results per @akron that gives an error:

structure(list(StudyID = c(2930L, 2938L, 2939L, 2930L, 2938L, 
2939L, 2930L, 2938L, 2939L, 2930L, 2938L, 2939L, 2930L, 2938L, 
2939L, 2930L, 2938L, 2939L, 2920L, 2941L, 2989L, 2920L, 2941L, 
2989L, 2920L, 2941L, 2989L, 2920L, 2941L, 2989L, 2920L, 2941L, 
2989L, 2920L, 2941L, 2989L, 2922L, 2924L, 2943L, 2922L, 2924L, 
2943L, 2922L, 2924L, 2943L, 2922L, 2924L, 2943L, 2922L, 2924L
), TeamID = c(551L, 551L, 551L, 551L, 551L, 551L, 551L, 551L, 
551L, 551L, 551L, 551L, 551L, 551L, 551L, 551L, 551L, 551L, 552L, 
552L, 552L, 552L, 552L, 552L, 552L, 552L, 552L, 552L, 552L, 552L, 
552L, 552L, 552L, 552L, 552L, 552L, 553L, 553L, 553L, 553L, 553L, 
553L, 553L, 553L, 553L, 553L, 553L, 553L, 553L, 553L), CATMERater = c(1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 
2L, 1L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 
2L), Rated = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 
6, 6, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 1, 
1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5), Rating = c(5, 3.8, 5, 
3.66666666666667, 4, 3.86666666666667, 4.53333333333333, 4, 4.8, 
NaN, NaN, NaN, NaN, NaN, NaN, NA, NA, NA, 3.93333333333333, 5, 
5, 5, 5, 5, 5, 5, 5, NaN, NaN, NaN, NaN, NaN, NaN, NA, NA, NA, 
4, 4, 4, 4, 4, 4, 4, 3.86666666666667, 4, NaN, NaN, NaN, NaN, 
NaN)), .Names = c("StudyID", "TeamID", "CATMERater", "Rated", 
"Rating"), class = c("tbl_df", "data.frame"), row.names = c(NA, 
-50L))

Check out [how to create a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for better ways to share sample data to make it easier to help you. — MrFlick, Apr 27 '16 at 18:14
`edges %>% group_by(Rated, TeamID) %>% mutate(new = StudyID[CATMERater == Rated]) ` ? — jeremycg, Apr 27 '16 at 18:19
@jeremycg I tried this but it gave an error `Error: incompatible size (0), expecting 3 (the group size) or 1` — Andrew Jackson, Apr 27 '16 at 18:45
Is it possible that you have more than one rows that have the same value for each Rated and TeamID — akrun, Apr 27 '16 at 18:50
It looks like you have some where they did not rate themselves - what do you want in that case? — jeremycg, Apr 27 '16 at 18:54
@akrun Each team member rated each other so there are 4 or 5 Rated per TeamID and Rater. — Andrew Jackson, Apr 27 '16 at 19:00
@jeremycg I am going to be dropping those for this case because I'm putting this into a graph visualization. — Andrew Jackson, Apr 27 '16 at 19:01
Please update your post with a small data that gives the error message. The above dataset is not showing any error for me. — akrun, Apr 27 '16 at 19:02
Can you also show the expected output based on the new dataset — akrun, Apr 27 '16 at 19:08
@akrun and @jeremycg with my data I noticed that NA and NaN values might be throwing this off so I used `edges %>% group_by(Rated, TeamID) %>% filter(!is.na(Rating)) %>% mutate(new = StudyID[CATMERater == Rated])` with the added `filter()` and it worked. I'm double checking the results now but it looks good! — Andrew Jackson, Apr 27 '16 at 19:13
Yes, I would use `x %>% group_by(Rated, TeamID) %>% filter(any(CATMERater == Rated)) %>% mutate(new = StudyID[CATMERater == Rated]) `, but same idea — jeremycg, Apr 27 '16 at 19:15
@jeremycg Thanks for the answer. I'll accept it if you add it below. I'm trying to understand the StudyID[...] part. It is looking within the group to find the ID where the rater is the same as the one being rated, right? — Andrew Jackson, Apr 27 '16 at 20:25

score 2 · Accepted Answer · answered Apr 27 '16 at 23:08

From the comments:

library(dplyr)
x %>%
   group_by(Rated, TeamID) %>% #group by each team/rated individual
   filter(any(CATMERater == Rated)) %>% #filter out any groups with unrated individuals
   mutate(new = StudyID[CATMERater == Rated]) #make the new column

The new column is created by subsetting each group - it is the same as x$StudyID[x$CATMERater == x$Rated] would be on the whole dataframe. As long as we have one place where this is true (ie self rating) the value is set to every member of that group.

akrun · Answer 2 · 2016-04-27T19:16:00.020

With data.table

library(data.table)
setDT(edges)[ , RatedStudyID := StudyID[CATMERater == Rated] , .(Rated, TeamID)]
edges
#   StudyID TeamID CATMERater Rated   Rating RatedStudyID
#1:    2930    551          1     1 5.000000         2930
#2:    2938    551          2     1 3.800000         2930
#3:    2939    551          3     1 5.000000         2930
#4:    2930    551          1     2 3.666667         2938
#5:    2938    551          2     2 4.000000         2938
#6:    2939    551          3     2 3.866667         2938

In the new dataset, there are groups that don't have any similar values for CATMERater and Rated in the same row. So, we can use an exception to return NA for those.

setDT(df1)[, RatedStudyID :=if(!any(CATMERater==Rated)) NA_integer_
             else StudyID[CATMERater ==Rated], .(Rated, TeamID)]

score 0 · Answer 3 · answered Apr 27 '16 at 19:29

0

I think you can solve this with a join

edges %>%
  select(TeamID, Rated = CATMERater, RaterStudyID = StudyID) %>%
  inner_join(edges, by = c("TeamID", "Rated"))

answered Apr 27 '16 at 19:29

Thierry

18,049
5
48
66

dplyr lookup from same table

3 Answers3