3

This is my first question in this forum and I have limited experience with R, so my apologies if the question is unclear in some way or if it is too basic.

I have a dataframe called values that consists of a sample number column, two factor variables (H and W), and several number columns (named as number intervals after a cut), as follows:

sample  H   W   (12.95,13]  (13,13.05]  (13.05,13.1]    (13.1,13.15]
130 bg  d   0   0   0   0
131 bg  d   0   0   0   0
132 bg  d   0   0   0   0
133 x   i   0   0   0   0
134 x   i   0   0   0   0
135 x   i   0   0   0   0
136 x   i   0   0   0   0
137 x   i   0   0   0   0
138 x   i   0   0   0   0
139 x   i   0   0   0   0
140 x   i   0   0   0   0
141 x   i   0   35947.65    0   0
142 x   i   0   0   0   0
143 x   i   0   0   0   0
144 x   i   0   0   0   0
145 x   i   0   0   0   73709.67
146 x   i   0   0   0   0
147 x   i   21756.63    0   32362.41    0
148 x   i   0   0   0   0
149 x   i   0   0   0   0
150 x   i   0   0   0   0
151 x   i   0   0   0   0
152 x   c   0   0   0   0
153 x   c   0   0   0   0
154 x   c   0   0   0   0
155 x   c   0   0   0   32578.03

I need to count how many rows in each of the number columns for each treatment combination and sample number have values greater than 0. I have tried the aggregate, count and sum functions but have had no success so far.

Could someone help me out with this?

Thank you!

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
EVN
  • 31
  • 2
  • 1
    Can you show us your desired output too? – David Arenburg Jun 16 '15 at 18:58
  • 1
    just to clarify, you want to "group by" the variables H and W, right? And you want to count only the nonzero values of the next four columns, by group? (What are the real column names, btw?) – C8H10N4O2 Jun 16 '15 at 19:09
  • I can try to explain it better. The output should be the sum of the number of rows in each numeric column (which are readings from an experiment) greater than 0 for each combination of H and W. Thus, I do want to group by H and W and count the nonzero values of these next four columns. The column names are just the numbers, which come directly from a mass spectrometer software. – EVN Jun 16 '15 at 19:50

4 Answers4

2

With data table (and assuming df is your data frame):

library(data.table)
setDT(df)[`colname`>0, .N, by=list(H, W, sample)]

or

setDT(df)[`colname`>0, .N, by=list(H, W)]

if you don't care about sample.

Where you should replace colname with the name of the particular column you're looking at. This would be easier for me to check if you provided a reproducible example.

Community
  • 1
  • 1
Serban Tanasa
  • 3,592
  • 2
  • 23
  • 45
0
#replicable example
set.seed(123)
values <- data.frame(sample=1:100,
                     a=rep(1,100),
                     b=rep(c(1,2),50),
                     v1=rbinom(100,1,.1) * runif(100),
                     v2=rbinom(100,1,.1) * runif(100),
                     v3=rbinom(100,1,.1) * runif(100)
                     )

aggregate(cbind(v1, v2, v3) ~ a + b, # apply fcn to LHS grouped by RHS
          data=values,              
          FUN=function(x) sum(x>0)  # sum of TRUE v>0's is count of v's greater than 0 
          )
#   a b v1 v2 v3
# 1 1 1  4  4  7
# 2 1 2  3  6  2
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • How does that check for certain columns having values > 0 ? – Serban Tanasa Jun 16 '15 at 19:02
  • After re-reading the question, I believe you're right, though it's still not clear to me what the OP wants. I'll re-answer it. I agree that data.table is very useful here, but it can be frustrating to novices, and learning how to do it in base is a good exercise. – C8H10N4O2 Jun 16 '15 at 19:06
0

Maybe I didn't get it (my solution seems to be very simple), but I just apply on rows the sum of where columns are all different from 0. The output is a numeric vector with length equal to the number of row of you data where:

  • 0 means there is no one column with a value other than 0
  • 1 means there is at least one column with a value greater than 0 and so on...

     apply(!df[, 4:7] == 0, 1, sum)
    
    [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 2 0 0 0 0 0 0 0 1
    
SabDeM
  • 7,050
  • 2
  • 25
  • 38
0

An imperfect solution using plyr (I am pretty sure dplyr package can do even better job, but I am less familiar with it)

The downside is that sums have to be calculated for each column separately - if there are 3 or 4 of them, it is OK, but for 100 intervals that wouldn't be viable.

##Generate fake data with 3 samples, 2 factors 3 levels each 
##and 3 observations per combination
df <- expand.grid(sample = letters[1:3], 
                  f1 = paste0('x', 1:3), 
                  f2 = paste0('y', 1:3))
df <- rbind(df, df, df)
nums <- matrix(rnorm(4*nrow(df)), ncol = 4)
colnames(nums) <- paste0('val_', 1:4)
nums[nums < 1] <- 0
df <- cbind(df, nums)

##Summarize
require(plyr)
ddply(df, .(sample, f1, f2), summarize, 
           sum_1 = sum(val_1 > 0),
           sum_2 = sum(val_2 > 0))
donshikin
  • 1,423
  • 1
  • 8
  • 6