I want to split a data frame into several smaller ones. This looks like a very trivial question, however I cannot find a solution from web search.
-
3never understood `split()`, but using [`ntile` from `dplyr`](http://stackoverflow.com/a/27646599/1888983) and then filtering by the group index ("quartile") did what I wanted: `group = df[df$quartile==i,]`. – jozxyqk Feb 17 '15 at 08:14
8 Answers
You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes.
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
gives
$`1`
num let LET
3 3 c C
6 6 f F
10 10 j J
12 12 l L
14 14 n N
15 15 o O
17 17 q Q
18 18 r R
20 20 t T
21 21 u U
22 22 v V
23 23 w W
26 26 z Z
$`2`
num let LET
1 1 a A
2 2 b B
4 4 d D
5 5 e E
7 7 g G
8 8 h H
9 9 i I
11 11 k K
13 13 m M
16 16 p P
19 19 s S
24 24 x X
25 25 y Y
You can also split a data frame based upon an existing column. For example, to create three data frames based on the cyl
column in mtcars
:
split(mtcars,mtcars$cyl)
-
1Hey greg , I couldn't understand the syntax for the sample command , can you explain it. – Anirudh Feb 01 '15 at 09:45
-
"You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes." How is this an arbitrary number of data frames if you are specifying two dataframes here? – user5359531 Mar 16 '16 at 17:09
-
2
If you want to split a dataframe according to values of some variable, I'd suggest using daply()
from the plyr
package.
library(plyr)
x <- daply(df, .(splitting_variable), function(x)return(x))
Now, x
is an array of dataframes. To access one of the dataframes, you can index it with the name of the level of the splitting variable.
x$Level1
#or
x[["Level1"]]
I'd be sure that there aren't other more clever ways to deal with your data before splitting it up into many dataframes though.

- 8,867
- 4
- 37
- 32
-
please state upfront the package from which a non-base function is from - presumably you mean daply from package plyr? – mdsumner Jul 21 '10 at 20:12
-
I loaded plyr in my code snippet, so I thought it was clear, but I'll edit the answer prose for clarity. – JoFrhwld Jul 21 '10 at 20:18
-
I suggested `dlply` first, but it didn't automatically name the entries by the grouping variable. I don't know what I did first, but aparently `daply` doesn't work unless a function is specified. I edited the answer to work. – JoFrhwld Jul 21 '10 at 21:03
You could also use
data2 <- data[data$sum_points == 2500, ]
This will make a dataframe with the values where sum_points = 2500
It gives :
airfoils sum_points field_points init_t contour_t field_t
...
491 5 2500 5625 0.000086 0.004272 6.321774
498 5 2500 5625 0.000087 0.004507 6.325083
504 5 2500 5625 0.000088 0.004370 6.336034
603 5 250 10000 0.000072 0.000525 1.111278
577 5 250 10000 0.000104 0.000559 1.111431
587 5 250 10000 0.000072 0.000528 1.111524
606 5 250 10000 0.000079 0.000538 1.111685
....
> data2 <- data[data$sum_points == 2500, ]
> data2
airfoils sum_points field_points init_t contour_t field_t
108 5 2500 625 0.000082 0.004329 0.733109
106 5 2500 625 0.000102 0.004564 0.733243
117 5 2500 625 0.000087 0.004321 0.733274
112 5 2500 625 0.000081 0.004428 0.733587

- 377,200
- 20
- 156
- 213

- 161
- 1
- 2
-
hi, how would you go about if you wanted to split it dynamically into a different data_frame based on unique values in that column.? – kRazzy R Apr 06 '17 at 02:54
I just posted a kind of a RFC that might help you: Split a vector into chunks in R
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
## number of chunks
n <- 2
dfchunk <- split(x, factor(sort(rank(row.names(x))%%n)))
dfchunk
$`0`
num let LET
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J
11 11 k K
12 12 l L
13 13 m M
$`1`
num let LET
14 14 n N
15 15 o O
16 16 p P
17 17 q Q
18 18 r R
19 19 s S
20 20 t T
21 21 u U
22 22 v V
23 23 w W
24 24 x X
25 25 y Y
26 26 z Z
Cheers, Sebastian
The answer you want depends very much on how and why you want to break up the data frame.
For example, if you want to leave out some variables, you can create new data frames from specific columns of the database. The subscripts in brackets after the data frame refer to row and column numbers. Check out Spoetry for a complete description.
newdf <- mydf[,1:3]
Or, you can choose specific rows.
newdf <- mydf[1:3,]
And these subscripts can also be logical tests, such as choosing rows that contain a particular value, or factors with a desired value.
What do you want to do with the chunks left over? Do you need to perform the same operation on each chunk of the database? Then you'll want to ensure that the subsets of the data frame end up in a convenient object, such as a list, that will help you perform the same command on each chunk of the data frame.

- 171
- 5
subset()
is also useful:
subset(DATAFRAME, COLUMNNAME == "")
For a survey package, maybe the survey
package is pertinent?
If you want to split by values in one of the columns, you can use lapply
. For instance, to split ChickWeight
into a separate dataset for each chick:
data(ChickWeight)
lapply(unique(ChickWeight$Chick), function(x) ChickWeight[ChickWeight$Chick == x,])

- 3,534
- 1
- 26
- 39
Splitting the data frame seems counter-productive. Instead, use the split-apply-combine paradigm, e.g., generate some data
df = data.frame(grp=sample(letters, 100, TRUE), x=rnorm(100))
then split only the relevant columns and apply the scale()
function to x in each group, and combine the results (using split<-
or ave
)
df$z = 0
split(df$z, df$grp) = lapply(split(df$x, df$grp), scale)
## alternative: df$z = ave(df$x, df$grp, FUN=scale)
This will be very fast compared to splitting data.frames, and the result remains usable in downstream analysis without iteration. I think the dplyr syntax is
library(dplyr)
df %>% group_by(grp) %>% mutate(z=scale(x))
In general this dplyr solution is faster than splitting data frames but not as fast as split-apply-combine.

- 45,935
- 7
- 84
- 112