1

I have a list of baseball players (playerID). I want to find out how many different teams they were on when they played in their All Star games. For example, aaronha01 in the input played for the same team in five All Star appearances, so his NumTms is 1. The output image shows how I would like the output to be formatted.

CODE I tried this code, but it did not work. I would prefer an answer that uses dplyr.

df %>% group_by(playerID) %>% select(playerID, teamID) %>% mutate(teams = n_distinct(playerID)) %>% arrange(desc(teams))

INPUT

structure(list(playerID = c("aaronha01", "aaronha01", "aaronha01", 
"aaronha01", "aaronha01", "aasedo01", "abreubo01", "abreubo01", 
"abreujo02", "abreujo02", "abreujo02", "acunaro01", "adamsac01", 
"adcocjo01", "adcocjo01", "ageeto01", "ageeto01", "aguilje01", 
"aguilri01", "aguilri01", "aguilri01", "aguirha01", "albieoz01", 
"alcansa01", "alexado01", "alfoned01", "allendi01", "allendi01", 
"allendi01", "allendi01", "allendi01", "allendi01", "allendi01"
), teamID = c("MLN", "MLN", "MLN", "MLN", "MLN", "BAL", "PHI", 
"PHI", "CHA", "NYN", "ATL", "ATL", "NY1", "MLN", "DET", "CHA", 
"CHA", "MIL", "MIN", "MIN", "MIN", "DET", "ATL", "MIA", "DET", 
"NYN", "PHI", "PHI", "PHI", "SLN", "CHA", "CHA", "CHA"), lgID = c("NL", 
"NL", "NL", "NL", "NL", "AL", "NL", "NL", "AL", "AL", "AL", "NL", 
"NL", "NL", "NL", "AL", "AL", "NL", "AL", "AL", "AL", "AL", "NL", 
"NL", "AL", "NL", "NL", "NL", "NL", "NL", "AL", "AL", "AL")), row.names = c(NA, 
-33L), class = c("tbl_df", "tbl", "data.frame"))

How I would like the output to display:

enter image description here

I have tried to do it based on the stackoverflow post at How to add count of unique values by group to R data.frame, but was unsuccessful.

Metsfan
  • 510
  • 2
  • 8
  • what you are looking for is `aggregate(NumTms~playerID,cbind(unique(df),NumTms = 1),length)` which cannot be reproduced to the table you want. eg. What if a player was in two leagues? which league should you output? Also which team out of the many teams a player played for should you output? In tidyverse:`distinct(df)%>%group_by(playerID)%>%summarise(NumTms = n())` – Onyambu Jun 14 '20 at 19:01

2 Answers2

1

You were pretty close, you just need to subset the teams to lgID == "AL" before you count them.

df %>%
  group_by(playerID) %>%
  mutate(NumTms = n_distinct(teamID[lgID == "AL"])) %>%
  arrange(desc(NumTms))
# A tibble: 33 x 4
# Groups:   playerID [16]
   playerID  teamID lgID  NumTms
   <chr>     <chr>  <chr>  <int>
 1 abreujo02 CHA    AL         3
 2 abreujo02 NYN    AL         3
 3 abreujo02 ATL    AL         3
 4 aasedo01  BAL    AL         1
 5 ageeto01  CHA    AL         1
 6 ageeto01  CHA    AL         1
 7 aguilri01 MIN    AL         1
 8 aguilri01 MIN    AL         1
 9 aguilri01 MIN    AL         1
10 aguirha01 DET    AL         1
Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
1

Here is an approach that uses dplyr::summarise() to create one row per playerID.

library(dplyr)
df %>% 
     group_by(playerID) %>% filter(lgID == "AL") %>%
     summarise(.,numTeams = n_distinct(teamID))

...and the output:

  playerID  numTeams
  <chr>        <int>
1 aasedo01         1
2 abreujo02        3
3 ageeto01         1
4 aguilri01        1
5 aguirha01        1
6 alexado01        1
7 allendi01        1

If we add lgID to the group_by(), we can count the number of teams within each league played by playerID.

df %>% 
     group_by(playerID,lgID) %>% 
     summarise(.,numTeams = n_distinct(teamID))

...and the output:

   playerID  lgID  numTeams
   <chr>     <chr>    <int>
 1 aaronha01 NL           1
 2 aasedo01  AL           1
 3 abreubo01 NL           1
 4 abreujo02 AL           3
 5 acunaro01 NL           1
 6 adamsac01 NL           1
 7 adcocjo01 NL           2
 8 ageeto01  AL           1
 9 aguilje01 NL           1
10 aguilri01 AL           1
11 aguirha01 AL           1
12 albieoz01 NL           1
13 alcansa01 NL           1
14 alexado01 AL           1
15 alfoned01 NL           1
16 allendi01 AL           1
17 allendi01 NL           2
Len Greski
  • 10,505
  • 2
  • 22
  • 33
  • What purpose does the dot serve at the start of the Summarise function? – Metsfan Jun 16 '20 at 10:46
  • @Metsfan - The dot notation references the data frame in the pipeline from the earlier step. For details see the [help for summarise()](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/summarise). – Len Greski Jun 16 '20 at 10:55