Let me give you a tip here: Don't just assign everything in the global environment but use lists for this. That way you avoid all the things that can go wrong when meddling with the global environment. The code you have in your question, will overwrite the original dataset data
, so you'll be in trouble if you want to rerun that code when something went wrong. You'll have to reconstruct the original dataframe.
Second: If you need to split a data frame based on a factor and carry out some code on each part, you should take a look at split
, by
and tapply
, or at the plyr
and dplyr
packages.
Using Base R
With base R, it depends on what you want to do. In the most general case you can use a combination of split()
and lapply
or even a for
loop:
mylist <- split( data, f = data$cluster)
for(mydata in mylist){
head(mydata)
...
}
Or
mylist <- split( data, f = data$cluster)
result <- lapply(mylist, function(mydata){
doSomething(mydata)
})
Which one you use, depends largely on what the result should be. If you need some kind of a summary for every subset, using lapply
will give you a list with the results per subset. If you need this for a simulation or plotting or so, you better use the for
loop.
If you want to add some variables based on other variables, then the plyr
or dplyr
packages come in handy
Using plyr and dplyr
These packages come especially handy if the result of your code is going to be an array or data frame of some kind. This would be similar to using split
and lapply
but then in a way Hadley approves of :-)
For example:
library(plyr)
result <- ddply(data, .(cluster),
function(mydata){
doSomething(mydata)
})
Use dlply
if the result should be a list.