0

I have a data frame that looks something like this:

x    y
1    a
1    b
1    c
1    NA
1    NA
2    d
2    e
2    NA
2    NA

And my desired output should be a data frame that should display the sum of all complete cases of Y (that is the non-NA values) with the corresponding X. So if supposing Y has 2500 complete observations for X = 1, and 557 observations for X = 2, I should get this simple data frame:

x    y(c.cases)
1    2500
2    557

Currently my function performs well but only for a single X but when I mention X to be a range (for ex. 30:25) then I get the sum of all the Ys specified instead of individual complete observations for each X. This is an outline of my function:

complete <- function(){
    files <- file.list()
    dat<- c() #Creates an empty vector
    Y <- c()  #Empty vector that will list down the Ys
    result <- c()
    for(i in c(X)){
            dat <- rbind(dat, read.csv(files[i]))
            }
            dat_subset_Y <- dat[which(dat[, 'X'] %in% x), ]
            Y <- c(Y, sum(complete.cases(dat)))
            result <- cbind(X, Y)
            print(result)
    }

There are no errors or warning messages but only wrong results in a range of Xs.

Jaap
  • 81,064
  • 34
  • 182
  • 193
  • 1
    See also this http://stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group – David Arenburg Oct 19 '15 at 08:54
  • I wonder why there are so many R questions on SO requesting solutions with FOR loops. – RHertel Oct 19 '15 at 09:58
  • @RHertel Thats because looping is one of the most common methods and packages like dplyr offer unique and new syntaxes in place of these loops, something that most people are not fully aware of and some like me are still getting a grasp of it. –  Oct 19 '15 at 10:11
  • It is perfectly alright not to be aware of something, and many people here are willing to give excellent hints. What surprises me is that, in spite of this lack of knowledge, several questions ask for a specific way to solve a problem - like using FOR loops. FOR loops are uncommon and in many cases at best unnecessary in R (by the way, this has nothing to do with dplyr, which is but one of many useful packages and not the reason why using FOR loops as a starting point is generally not a good idea in R. FOR loops can be useful in some cases, but those are rather exceptions in R). – RHertel Oct 19 '15 at 10:25
  • @RHertel If those are questions being asked by the same people over and over again, then that would come out as strange to me too. –  Oct 19 '15 at 13:09

2 Answers2

3

We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'x', get the sum of all non NA elements (!is.na(y)).

library(data.table)
setDT(df1)[, list(y=sum(!is.na(y))), by = x]

Or another option is table

with(df1, table(x, !is.na(y)))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 3
    And `aggregate(y ~ x, df, function(z) sum(!is.na(z)))` and `with(df, tapply(y, x, FUN = function(z) sum(!is.na(z))))` and the dplyr stuff, and etc/ – David Arenburg Oct 19 '15 at 08:52
2

no need for that loop.

library(dplyr)
df %>%
  filter(complete.cases(.))%>%
  group_by(x) %>%
  summarise(sumy=length(y))

Or

df %>% 
  group_by(x) %>% 
  summarise(sumy=sum(!is.na(y)))
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Paulo E. Cardoso
  • 5,778
  • 32
  • 42