Subsetting data by condition

Question

I am trying to reshape/ reduce my data. So far, I employ a for loop (very slow) but from what I perceive, this should be quite fast with Plyr.

I have many groups (firms, as a factor in the dataset) and I want to drop entirely every firm which shows a 0 entry for value in any of that firm's cells. I thus create a new data.frame but leave out all groups showing 0 for value at some point.

The forloop:

Data Creation:

set.seed(1) 
mydf <- data.frame(firmname = sample(LETTERS[1:5], 40, replace = TRUE), 
        value = rpois(40, 2))

-----------------------------
splitby = mydf$firmname


new.data <- data.frame()

for (i in 1:(length(unique(splitby)))) {
temp <- subset(mydf, splitby == as.character(paste(unique(splitby)[i]))) 
    if (all(temp$value > 0) == "TRUE") {     
    new.data <- rbind(new.data, temp) 
} 
} 

Delete all empty firm factors 
new.data$splitby <- factor(new.data$splitby)

Is there a way to achieve that with the plyr package? Can the subset function be used in that context?

EDIT: For the purpose of the reproduction of the problem, data creation, as suggested by BenBarnes, is added. Ben, thanks a lot for that. Furthermore, my code is altered so as to comply with the answers provided below.

You don't provide sample data, but this sounds like a standard subset using the `[` operator. — Andrie, Apr 27 '12 at 11:27
@Andrie it sounds to me like he wants to drop all entries in a group in which any entry meets some condition. So `plyr` or `by` seem easier. Jan, please read this as it will help us solve your question: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Ari B. Friedman, Apr 27 '12 at 12:03
Thanks a lot for your answers! I edited the post and added data reproduction as suggested by Ben (see below). — Jan, Apr 30 '12 at 13:20
@Andrie: thanks a lot! The standard subset is actually what I need. Computationally, my for loop needs 122 seconds with my small test data set (~55k observations) , Plyr 17 seconds and the [] operation only 0.07! — Jan, Apr 30 '12 at 13:23

BenBarnes · Accepted Answer · 2012-05-02T08:23:39.993

5

You could supply an anonymous function to the .fun argument in ddply():

set.seed(1)

mydf <- data.frame(firmname = sample(LETTERS[1:5], 40, replace = TRUE),
  value = rpois(40, 2))

library(plyr)

ddply(mydf,.(firmname), function(x) if(any(x$value==0)) NULL else x )

Or using [, as suggested by Andrie:

firms0 <- unique(mydf$firmname[which(mydf$value == 0)])

mydf[-which(mydf$firmname %in% firms0), ]

Note that the results of ddply are sorted according to firmname

EDIT

For the example in your comments, this approach is again faster than using ddply() to subset, selecting only firms with more than three entries:

firmTable <- table(mydf$firmname)

firmsGT3 <- names(firmTable)[firmTable > 3]

mydf[mydf$firmname %in% firmsGT3, ]

edited May 02 '12 at 08:23

answered Apr 27 '12 at 12:02

BenBarnes

19,114
6
56
74

Hey Ben, thanks for these great answers! It is exactly what I was looking for. I was not aware of how to apply the 2nd version (the [ operation) on entire groups. As written in a comment further above, the [ operation is much faster than the Plyr or the for loop. – Jan Apr 30 '12 at 13:25
The 2nd answer works nice when conditioning on single cell values. Would it also work when conditioning on number of rows? I tried Plyr: `mydf <- ddply(mydf,.(firmname), function(x) if(length(x$firmname < 3 )) NULL else x )` and many other approaches with the `[` approach but can't get it to work. – Jan Apr 30 '12 at 17:13
I got it to work! `ddply(mydf, .(firmname), function(x) if(length(x$firmname) > 3) NULL else x )` is doing the job. This selects all groups (firms) with more than 3 observations. With `<` this data is sorted out. I guess it was only one `)` too much. – Jan Apr 30 '12 at 17:43
@Jan, thanks for the interesting additional question, and nice job solving it! I've edited my answer to include an alternative solution that seems to be faster than your solution using `ddply()`. – BenBarnes May 02 '12 at 08:26

Subsetting data by condition

1 Answers1