86

I want to split a data frame into several smaller ones. This looks like a very trivial question, however I cannot find a solution from web search.

josliber
  • 43,891
  • 12
  • 98
  • 133
Leo5188
  • 1,967
  • 2
  • 17
  • 21
  • 3
    never understood `split()`, but using [`ntile` from `dplyr`](http://stackoverflow.com/a/27646599/1888983) and then filtering by the group index ("quartile") did what I wanted: `group = df[df$quartile==i,]`. – jozxyqk Feb 17 '15 at 08:14

8 Answers8

71

You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes.

x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))

gives

$`1`
   num let LET
3    3   c   C
6    6   f   F
10  10   j   J
12  12   l   L
14  14   n   N
15  15   o   O
17  17   q   Q
18  18   r   R
20  20   t   T
21  21   u   U
22  22   v   V
23  23   w   W
26  26   z   Z

$`2`
   num let LET
1    1   a   A
2    2   b   B
4    4   d   D
5    5   e   E
7    7   g   G
8    8   h   H
9    9   i   I
11  11   k   K
13  13   m   M
16  16   p   P
19  19   s   S
24  24   x   X
25  25   y   Y

You can also split a data frame based upon an existing column. For example, to create three data frames based on the cyl column in mtcars:

split(mtcars,mtcars$cyl)
joran
  • 169,992
  • 32
  • 429
  • 468
Greg
  • 11,564
  • 5
  • 41
  • 27
  • 1
    Hey greg , I couldn't understand the syntax for the sample command , can you explain it. – Anirudh Feb 01 '15 at 09:45
  • "You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes." How is this an arbitrary number of data frames if you are specifying two dataframes here? – user5359531 Mar 16 '16 at 17:09
  • 2
    @user5359531, arbitrary two data frames here. – Demo Dec 10 '16 at 21:28
19

If you want to split a dataframe according to values of some variable, I'd suggest using daply() from the plyr package.

library(plyr)
x <- daply(df, .(splitting_variable), function(x)return(x))

Now, x is an array of dataframes. To access one of the dataframes, you can index it with the name of the level of the splitting variable.

x$Level1
#or
x[["Level1"]]

I'd be sure that there aren't other more clever ways to deal with your data before splitting it up into many dataframes though.

JoFrhwld
  • 8,867
  • 4
  • 37
  • 32
  • please state upfront the package from which a non-base function is from - presumably you mean daply from package plyr? – mdsumner Jul 21 '10 at 20:12
  • I loaded plyr in my code snippet, so I thought it was clear, but I'll edit the answer prose for clarity. – JoFrhwld Jul 21 '10 at 20:18
  • I suggested `dlply` first, but it didn't automatically name the entries by the grouping variable. I don't know what I did first, but aparently `daply` doesn't work unless a function is specified. I edited the answer to work. – JoFrhwld Jul 21 '10 at 21:03
16

You could also use

data2 <- data[data$sum_points == 2500, ]

This will make a dataframe with the values where sum_points = 2500

It gives :

airfoils sum_points field_points   init_t contour_t   field_t
...
491        5       2500         5625 0.000086  0.004272  6.321774
498        5       2500         5625 0.000087  0.004507  6.325083
504        5       2500         5625 0.000088  0.004370  6.336034
603        5        250        10000 0.000072  0.000525  1.111278
577        5        250        10000 0.000104  0.000559  1.111431
587        5        250        10000 0.000072  0.000528  1.111524
606        5        250        10000 0.000079  0.000538  1.111685
....
> data2 <- data[data$sum_points == 2500, ]
> data2
airfoils sum_points field_points   init_t contour_t   field_t
108        5       2500          625 0.000082  0.004329  0.733109
106        5       2500          625 0.000102  0.004564  0.733243
117        5       2500          625 0.000087  0.004321  0.733274
112        5       2500          625 0.000081  0.004428  0.733587
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • hi, how would you go about if you wanted to split it dynamically into a different data_frame based on unique values in that column.? – kRazzy R Apr 06 '17 at 02:54
13

I just posted a kind of a RFC that might help you: Split a vector into chunks in R

x = data.frame(num = 1:26, let = letters, LET = LETTERS)
## number of chunks
n <- 2
dfchunk <- split(x, factor(sort(rank(row.names(x))%%n)))
dfchunk
$`0`
   num let LET
1    1   a   A
2    2   b   B
3    3   c   C
4    4   d   D
5    5   e   E
6    6   f   F
7    7   g   G
8    8   h   H
9    9   i   I
10  10   j   J
11  11   k   K
12  12   l   L
13  13   m   M

$`1`
   num let LET
14  14   n   N
15  15   o   O
16  16   p   P
17  17   q   Q
18  18   r   R
19  19   s   S
20  20   t   T
21  21   u   U
22  22   v   V
23  23   w   W
24  24   x   X
25  25   y   Y
26  26   z   Z

Cheers, Sebastian

Community
  • 1
  • 1
Sebastian
  • 3,679
  • 3
  • 19
  • 14
8

The answer you want depends very much on how and why you want to break up the data frame.

For example, if you want to leave out some variables, you can create new data frames from specific columns of the database. The subscripts in brackets after the data frame refer to row and column numbers. Check out Spoetry for a complete description.

newdf <- mydf[,1:3]

Or, you can choose specific rows.

newdf <- mydf[1:3,]

And these subscripts can also be logical tests, such as choosing rows that contain a particular value, or factors with a desired value.

What do you want to do with the chunks left over? Do you need to perform the same operation on each chunk of the database? Then you'll want to ensure that the subsets of the data frame end up in a convenient object, such as a list, that will help you perform the same command on each chunk of the data frame.

Ben M
  • 171
  • 5
7

subset() is also useful:

subset(DATAFRAME, COLUMNNAME == "")

For a survey package, maybe the survey package is pertinent?

http://faculty.washington.edu/tlumley/survey/

DJV
  • 4,743
  • 3
  • 19
  • 34
apeescape
  • 1,109
  • 7
  • 10
3

If you want to split by values in one of the columns, you can use lapply. For instance, to split ChickWeight into a separate dataset for each chick:

data(ChickWeight)
lapply(unique(ChickWeight$Chick), function(x) ChickWeight[ChickWeight$Chick == x,])
mikeck
  • 3,534
  • 1
  • 26
  • 39
3

Splitting the data frame seems counter-productive. Instead, use the split-apply-combine paradigm, e.g., generate some data

df = data.frame(grp=sample(letters, 100, TRUE), x=rnorm(100))

then split only the relevant columns and apply the scale() function to x in each group, and combine the results (using split<- or ave)

df$z = 0
split(df$z, df$grp) = lapply(split(df$x, df$grp), scale)
## alternative: df$z = ave(df$x, df$grp, FUN=scale)

This will be very fast compared to splitting data.frames, and the result remains usable in downstream analysis without iteration. I think the dplyr syntax is

library(dplyr)
df %>% group_by(grp) %>% mutate(z=scale(x))

In general this dplyr solution is faster than splitting data frames but not as fast as split-apply-combine.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112