Using some sample data:
dat <- read.table(stringsAsFactors=FALSE, header=TRUE, text="
G1S1 G1S2 G1S3 G2S1 G2S2 G2S3 G3S1 G3S2 G3S3
111 121 131 211 221 231 311 321 331
112 122 132 212 222 232 312 322 332
113 123 133 213 223 233 313 323 333")
Wanting to split solely on the second character of each name, let's first know how to break that out from the rest:
gsub("S.*", "", names(dat))
# [1] "G1" "G1" "G1" "G2" "G2" "G2" "G3" "G3" "G3"
Now what we want to do is extract the columns into these three groups. One way is to split
them up. If we use split
by itself, it doesn't work:
split(dat, gsub("S.*", "", names(dat)))
# Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
# data length is not a multiple of split variable
# $G1
# G1S1 G1S2 G1S3 G2S1 G2S2 G2S3 G3S1 G3S2 G3S3
# 1 111 121 131 211 221 231 311 321 331
# 2 112 122 132 212 222 232 312 322 332
# 3 113 123 133 213 223 233 313 323 333
# $G2
# [1] G1S1 G1S2 G1S3 G2S1 G2S2 G2S3 G3S1 G3S2 G3S3
# <0 rows> (or 0-length row.names)
# $G3
# [1] G1S1 G1S2 G1S3 G2S1 G2S2 G2S3 G3S1 G3S2 G3S3
# <0 rows> (or 0-length row.names)
This is because split
sees that it is working on a data.frame
and tries to do things by-row, which is not what we want. If you search around, there are a number of split
S3 methods (functions that have specific versions based on the first argument) available:
methods("split")
# [1] split.data.frame split.Date split.default split.POSIXct
# see '?methods' for accessing help and source code
The version we were silently using is split.data.frame
, because dat
is a frame. We can override that:
lodf <- split.default(dat, gsub("S.*", "", names(dat)))
lodf
# $G1
# G1S1 G1S2 G1S3
# 1 111 121 131
# 2 112 122 132
# 3 113 123 133
# $G2
# G2S1 G2S2 G2S3
# 1 211 221 231
# 2 212 222 232
# 3 213 223 233
# $G3
# G3S1 G3S2 G3S3
# 1 311 321 331
# 2 312 322 332
# 3 313 323 333
From here: I personally recommend you keep them in this list
-of-frames (ergo my temp name lodf
) structure. This is made on the assumption that something you do to one you will also do (the same thing) to the others, in which case lapply
is a natural choice for operations.
If you really need to break this down (again, discouraged), you can assign them as the group names to the calling environment with:
for (nm in names(lodf)) assign(nm, lodf[[nm]])
ls()
# [1] "dat" "G1" "G2" "G3" "lodf"
A related answer dealing with frames-within-lists is How do I make a list of data frames?
You might go one step further, perhaps to deal with this in a "tidy" (i.e., "long") format: incorporate the group into the frames themselves:
do.call(rbind.data.frame, c(mapply(function(nm, x) {
# remove the 'G#' from each column name, allows row-binding later
names(x) <- gsub(paste0("^", nm), "", names(x))
# add the group name as a new column
transform(x, Grp = nm)
}, names(lodf), lodf, SIMPLIFY = FALSE), list(stringsAsFactors = FALSE)))
# S1 S2 S3 Grp
# G1.1 111 121 131 G1
# G1.2 112 122 132 G1
# G1.3 113 123 133 G1
# G2.1 211 221 231 G2
# G2.2 212 222 232 G2
# G2.3 213 223 233 G2
# G3.1 311 321 331 G3
# G3.2 312 322 332 G3
# G3.3 313 323 333 G3
(I don't like the row names G1.1
etc, but they're harmless.) Explanation:
mapply
is similar to lapply
except it takes 1 or more arguments per function-call, effectively "zipping" the list arguments together. The arguments to it are names(lodf)
and lodf
, so the first call (calling anonfunc
the anonymous function taking (nm, x)
as its arguments) looks like anonfunc(names(lodf)[[1]], lodf[[1]])
, the second anonfunc(names(lodf)[[2]], lodf[[2]])
, etc.
- within that anonymous function, I first remove the
G#
from the column names. This allows us to row-bind them later as S1
, S2
, etc.
- I add the group name to the frames as
Grp
.
This format allows you to do one thing once, applying grouping by the Grp
variable as needed. If you're using dplyr
or data.table
, for instance, it's quite easy to do ... %>% dplyr::group_by(Grp) %>% ...
or DT[, .(...), by="Grp"]
, respectively.