2

I want to write a loop that can aggregate the number of instances (of certain values) that are grouped by year. More specifically, say the variable is x1. I want to have two groups, one is when x1 = 1, and the other when it is a combination of some values (2,3, and 5 in the below example):

year    x1
2000    1
2000    1
2000    2
2000    3
2000    5

The end result should look like this:

year    x2    x3
2000    2     3

where x2 and x3 are the counts when x1 = 1 and x1 = c(2,3,5), respectively. How can one accomplish this?

Edit: Probably should have mentioned this earlier. I work with two datasets; one df1 is yearly (spanning approx. 200 years) and the other df2 is incident-based (around 50k observations; this is where x1 is currently located). So the idea of the loop is to look at each year[i] in df2 and aggregate the counts by grouping them as x2 and x3 in df1.

Edit2: Ah, I solved why the submitted answers were not working for me. Apparently I ran into the dplyr before plyr problem discussed in this answer; I followed ManneR's answer and detached plyr. Now the group_by command works again.

Community
  • 1
  • 1
  • What distincts one group from another? – David Arenburg Dec 25 '16 at 20:48
  • There are three variables measuring the same phenomenon, and they are all numerical. However, I only want to distinguish certain categories. If it helps, the first group (x2) is always a scalar, whereas the second group (x3) is always an aggregated count of several categories. –  Dec 25 '16 at 20:52

3 Answers3

0

I am not sure what was wrong with user3349904's answer as it seems to do what you are asking. Its not easy to know exactly what you are asking for without knowing what your data looks like. If your issue with the other solution due to the fact that df1 needs to hold the x2 and x3 values? The last part will solve for that.

I tried to replicate your problem from scratch so here's my shot at a solution.

library(dplyr)

#create DF1 (years)
df1 <- as.data.frame(matrix(ncol=3,nrow = 200))
df1$V1 <- c(1800:1999)
colnames(df1) <- c("year","x2","x3")

#create DF2 (transactions)
df2 <- as.data.frame(matrix(ncol=2,nrow=50000))
#add random sample data
df2$V1 <- sample(1800:1999,50000,replace = T)
df2$V2 <- sample(1:5,5000,replace = T)
colnames(df2) <- c("year","x1")

# group by year in df2 and aggregate counts based on categories
df2 %>% group_by(year) %>%
   summarise(x2 = sum(x1==1), x3 = sum(x1 %in% c(2,3,5))) -> df3

# match years in df3 and df1 and bring lookup value to df1
df1$x2 <- df3$x2[match(df1$year,df3$year)]
df1$x3 <- df3$x3[match(df1$year,df3$year)]
Collier
  • 56
  • 3
  • I made another edit to the original question, the `group_by` command was not working, so I kept getting a `df3` that consists of a single observation. Your code now does what I want after fixing the package loading sequence. –  Dec 26 '16 at 11:59
0

Here is another option using dplyr/tidyr

library(dplyr)
library(tidyr)
df1 %>%
    group_by(year, grp = paste0("x", (x1 != 1) + 2)) %>%
    summarise(x1= n()) %>% 
    spread(grp, x1)
#   year    x2    x3
#* <int> <int> <int>
#1  2000     2     3

Or using base R

xtabs(Freq~year + x1, transform(df1, x1= paste0("x", (x1!=1)+2), Freq= 1))
akrun
  • 874,273
  • 37
  • 540
  • 662
-1

Assuming you are starting from a data frame called df, this will count the cases as you describe them by year:

library(dplyr)
df %>% group_by(year) %>% summarise(x2 = sum(x1==1), x3 = sum(x1 %in% c(2,3,5)))
  • Thanks for the input, I clarified what I need by editing the question. Can you make it work so that it is applicable to my problem? –  Dec 25 '16 at 21:25