splitting data frame by repeating strings

Question

I have a data frame where one column will repeat the same string for a number of lines (it varies). I'd like to split the data frame based on each of the repeating names into separate data frames (the output can be a list). For example for this data frame:

dat = data.frame(names=c('dog','dog','dog','dog','cat','cat'), value=c(1,2,3,4,5,5))

The output should be

   names value
   dog     1
   dog     2
   dog     3
   dog     4

and

   names value
   cat     5
   cat     5

I should mention there are thousands of different repeating names.

If you're willing to install packages (`dplyr` or `data.table`), there are better ways of dealing with grouping variables than actually holding onto distinct data.frames. For example, in `data.table` you can use `dat[.("dog")]` to get that subset whenever you need it, and `dat[,do_stuff,by=names]` whenever you need to do the same operation on each group. (Not the downvoter.) — Frank, May 22 '15 at 16:24
What is supposed to happen with `names=c('dog','dog','dog','dog','cat','cat', 'dog','dog')`? — IRTFM, May 22 '15 at 16:28

akrun · Accepted Answer · 2015-05-22T16:54:16.517

1

You can use the split function, which will give the output in a list. I think it would be easier to have the datasets in the list as most of the operations can be performed within the list itself

 split(dat, dat$names)

If in case you want to split the 'dog', 'cat', 'dog' as a 'list' with 3 elements (based on the example showed by @BondedDust), one option is

 indx <- inverse.rle(within.list(rle(as.character(dat$names)), 
                values <- seq_along(values)))
 split(dat, indx)

Or using the devel version of data.table, we can use rleid to create a grouping variable

 library(data.table)#v1.9.5+
 setDT(dat)[, grp:= rleid(names)]

and then use the standard data.table operations for the different groups by specifying the 'grp' as the grouping variable.

edited May 22 '15 at 16:54

answered May 22 '15 at 16:04

akrun

874,273
37
540
662

1

Just in case if you still want separate data frames, you can do this `df = split(dat, dat$names)` and then `attach(df)` to get all dataframes separately – Veerendra Gadekar May 22 '15 at 16:10
1

@VeerendraGadekar I would use `list2env(df, envir=.GlobalEnv)` after naming the list elements – akrun May 22 '15 at 16:11
I think after `attach()` the dataframes in the list are already in environment. isn't it? – Veerendra Gadekar May 22 '15 at 16:15
1

@VeerendraGadekar It's just my preference to use `list2env` to avoid `atfach` because if we attach a single dataset, the column objects will be also in the global environment (though it is not the case here for list) – akrun May 22 '15 at 16:17

splitting data frame by repeating strings

1 Answers1