1

(Other questions & answers elsewhere on this forum don't seem to deal with the cross-border issue mentioned in this feed)

Suppose I have the following data:

df <- data.frame(id=c("Eric", "John", "Sarah", "Simon", "Abdul", "Charlotte", "Alex", "Susan"),
             state=c("CA", "AK", "NY", "NY", "NJ", "GA", "CA", "CA"),
             project=c(1, 2, 2, 2, 3, 4, 5, 5), stringsAsFactors = F)

> df
         id state project
1      Eric    CA       1
2      John    AK       2
3     Sarah    NY       2
4     Simon    NY       2
5     Abdul    NJ       3
6 Charlotte    GA       4
7      Alex    CA       5
8     Susan    CA       5

I want to get the average number of project members per state, also counting cross-border members.

To get the average of only in-state members, I did the following:

dfx <- data.frame()
dfy <- data.frame()
for(j in unique(df$state)){
   h <- subset(df, state==j)
   counts <- plyr::count(h, 'project')
   #uniques <- length(unique(sub$invje))
   average_members <- mean(counts$freq)
   dfx <- data.frame(state=j,
                average_members=average_members)
   dfy <- rbind(dfy, dfx)
} 

> dfy
  state average_members
1    CA             1.5
2    AK             1.0
3    NY             2.0
4    NJ             1.0
5    GA             1.0

The desired output I'm after, both AK and NY should score 3 because each ID works with two other IDs on a project (despite living in a different state).

> desired
  state average_members
1    CA             1.5
2    AK             3.0
3    NY             3.0
4    NJ             1.0
5    GA             1.0

Does anyone know how to code that?

wake_wake
  • 1,332
  • 2
  • 19
  • 46
  • Can you post your desire output ? also where `counts$freq` come from ? – BENY Jul 10 '17 at 19:27
  • @Wen `counts$freq` comes from `counts` generate by `plyr::count(h, 'project')` – wake_wake Jul 10 '17 at 19:32
  • Possible duplicate of [Aggregate / summarize multiple variables per group (i.e. sum, mean, etc)](https://stackoverflow.com/questions/9723208/aggregate-summarize-multiple-variables-per-group-i-e-sum-mean-etc) – Kristofersen Jul 10 '17 at 20:52

3 Answers3

3
library(data.table)

setDT(df)
df[, .(num_proj = .N), by = .(state, project)][, .(average_members = mean(num_proj)), by = state]

Result:

   state average_members
1:    CA             1.5
2:    AK             1.0
3:    NY             2.0
4:    NJ             1.0
5:    GA             1.0

For the second case, pull the state out of the group by in the first iteration.

unique(df[, .(state, num_proj = .N), by = project])[, .(average_members = mean(num_proj)), by = state]

1:    CA             1.5
2:    AK             3.0
3:    NY             3.0
4:    NJ             1.0
5:    GA             1.0
Eric Watt
  • 3,180
  • 9
  • 21
2

You can do this with the dplyr library. You can answer your within state only question with

library(dplyr)
df %>% count(state, project) %>% 
    group_by(state) %>% summarize(avg=mean(n))
#   state       avg
# 1    AK       1.0
# 2    CA       1.5
# 3    GA       1.0
# 4    NJ       1.0
# 5    NY       2.0

And you can get your across-state result with

df %>% distinct(project, state) %>% 
    inner_join(df %>% count(project)) %>% 
    group_by(state) %>% summarize(avg=mean(n))

#   state       avg
# 1    AK       3.0
# 2    CA       1.5
# 3    GA       1.0
# 4    NJ       1.0
# 5    NY       3.0
MrFlick
  • 195,160
  • 17
  • 277
  • 295
1
df <- data.frame(id=c("Eric", "John", "Sarah", "Simon", "Abdul", "Charlotte", "Alex", "Susan"),
                 state=c("CA", "AK", "NY", "NY", "NJ", "GA", "CA", "CA"),
                 project=c(1, 2, 2, 2, 3, 4, 5, 5), stringsAsFactors = F)


dfx <- data.frame()
dfy <- data.frame()

for (j in unique(df$state)) {
    h = subset(df, state==j)
    thisStatesProjects = unique(h[,"project"])
    h2 = subset(df, project %in% thisStatesProjects)
    average_members = nrow(h2)/length(thisStatesProjects)
    dfx <- data.frame(state=j,
                      average_members=average_members)
    dfy <- rbind(dfy, dfx)

}
Patrick
  • 513
  • 4
  • 14