0

Uing R, I want to count the number of occurences in two variables by two other variables; IDS and year. One of the variables counted need to be counted by unique value. I have really looked around for an answer to this but I cannot seem to find it. I have a dataset like this (though including many more variables):

IDS = c(1,1,1,1,1,1,2,2) 
year = c(1,1,1,1,1,2,1,1) 
x = c(5, 5, 5, 10, 2, NA, 3, 3)
y = c(1, 2, 4, 0, NA, 2, 0, NA)
dfxy = data.frame(IDS, year, x, y)

dfxy
   IDS year   x  y
1   1    1    5  1
2   1    1    5  2
3   1    1    5  4
4   1    1   10  0
5   1    1    2 NA
6   1    2   NA  2
7   2    1    3  0
8   2    1    3 NA

I want a count of the number of occurences in the two columns x and y by each IDS and each year. The count in x needs to be by unique value of x. I want an output like this:

   IDS year x y
1   1    1  3 4
2   1    2  0 1
3   2    1  1 1

It is similar to the answer with cbind in

Aggregate / summarize multiple variables per group (i.e. sum, mean, etc)

which for me would look like

aggregate(cbind(x, y)~IDS+year, data=dfxy, ???)

NA counts as no occurence, any number counts as an occurence in y, in x each unique occurence must be counted (as long as it is not NA). There are no rows with NA in both x and y. I have tried using length instead of sum, but this only seem to summarize the number of rows equally for both x and y.

Ideas or a link I can find an answer to this in? Thanks

Community
  • 1
  • 1
Lisarv
  • 87
  • 1
  • 9

2 Answers2

0

We can try with dplyr

library(dplyr)
dfxy %>% 
     group_by(IDS, year) %>%
     summarise_each(funs(sum(!is.na(.))))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Thanks @akrun, worked perfectly! But see my updated question, because I forgot something (in a minute) – Lisarv Dec 04 '16 at 20:06
  • So for the previous question, where the varible x did NOT need to be counted by unique values, but instead just in a similar way as y, your answer was perfect @akrun. My mistake that I discovered the difference in my data set late. – Lisarv Dec 04 '16 at 20:17
  • 1
    Sorry @akrun, I asked someone else who told me to do so, but I see your point and will remember it for next time if it happens! – Lisarv Dec 05 '16 at 12:52
0

In aggregate, you need to specify the na.action parameter, as with the formula interface it defaults to na.omit, which will exclude most of your data:

aggregate(cbind(x, y) ~ IDS + year, dfxy, 
          FUN = function(x){sum(!is.na(x))}, na.action = na.pass)
##   IDS year x y
## 1   1    1 3 3
## 2   2    1 1 1
## 3   1    2 0 1

For the new question, add unique:

aggregate(cbind(x, y) ~ IDS + year, df, 
          FUN = function(x){sum(!is.na(unique(x)))}, na.action = na.pass)
##   IDS year x y
## 1   1    1 3 4
## 2   2    1 1 1
## 3   1    2 0 1

or

aggregate(cbind(x, y) ~ IDS + year, df, 
          FUN = function(x){length(unique(na.omit(x)))}, na.action = na.pass)
##   IDS year x y
## 1   1    1 3 4
## 2   2    1 1 1
## 3   1    2 0 1
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • Question changed slightly, see above, did not see your answer in time @alistaire, sorry for that. – Lisarv Dec 04 '16 at 20:18