-2

I have a data file with n columns divided in 3 groups and each group has several subject. So the Headers are G1S1, G1S2...G2S1,G2S2.. The data in each column is independent from the others columns. What I'm trying to do is put all the data from Group 1 in column1. Group 2 in column 2 etc. I tried the following code that works for putting all all the data from group 1 into a new dataframe with one column G1 as header.

dt <-TestFile [1:5] # extract data from group 1
dt2 <- NULL
tmp1 <- NULL

for (i in 1:ncol(dt)) {
  ColName <- names(dt)[i] #Get the column mame
  tmp1 <- dt[ColName] #copy data to tmp1
  GrpName <- substring(ColName,1,2) #get group name from column name
  names(tmp1)[names(tmp1)==ColName]<-GrpName #rename column header to match column in dt2 '
  dt2 <- rbind (dt2,tmp1) # merge data together
}

This code works for one group, however if I want to add another group I'm stuck because the rbin function does now work anymore as the number of column are not the same anymore.

If I add those iff statements in the loop I can create 3 data.frames, but If I get files with different group numbers then I will have to add more if statements and that is not viable on a long run.

if (GrpName == 'G1'){
    dt1 <- rbind (dt1,tmp1) # merge data together}
  }
  if(GrpName == 'G2'){
    dt2 <- rbind (dt2,tmp1) # merge data together}
  }

  if(GrpName == 'G3'){
    dt3 <- rbind (dt3,tmp1) # merge data together}

  }

Any advise on were to go from here ?

JPC
  • 51
  • 1
  • 8

1 Answers1

0

Using some sample data:

dat <- read.table(stringsAsFactors=FALSE, header=TRUE, text="
G1S1 G1S2 G1S3 G2S1 G2S2 G2S3 G3S1 G3S2 G3S3
111  121  131  211  221  231  311  321  331
112  122  132  212  222  232  312  322  332
113  123  133  213  223  233  313  323  333")

Wanting to split solely on the second character of each name, let's first know how to break that out from the rest:

gsub("S.*", "", names(dat))
# [1] "G1" "G1" "G1" "G2" "G2" "G2" "G3" "G3" "G3"

Now what we want to do is extract the columns into these three groups. One way is to split them up. If we use split by itself, it doesn't work:

split(dat, gsub("S.*", "", names(dat)))
# Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
#   data length is not a multiple of split variable
# $G1
#   G1S1 G1S2 G1S3 G2S1 G2S2 G2S3 G3S1 G3S2 G3S3
# 1  111  121  131  211  221  231  311  321  331
# 2  112  122  132  212  222  232  312  322  332
# 3  113  123  133  213  223  233  313  323  333
# $G2
# [1] G1S1 G1S2 G1S3 G2S1 G2S2 G2S3 G3S1 G3S2 G3S3
# <0 rows> (or 0-length row.names)
# $G3
# [1] G1S1 G1S2 G1S3 G2S1 G2S2 G2S3 G3S1 G3S2 G3S3
# <0 rows> (or 0-length row.names)

This is because split sees that it is working on a data.frame and tries to do things by-row, which is not what we want. If you search around, there are a number of split S3 methods (functions that have specific versions based on the first argument) available:

methods("split")
# [1] split.data.frame split.Date       split.default    split.POSIXct   
# see '?methods' for accessing help and source code

The version we were silently using is split.data.frame, because dat is a frame. We can override that:

lodf <- split.default(dat, gsub("S.*", "", names(dat)))
lodf
# $G1
#   G1S1 G1S2 G1S3
# 1  111  121  131
# 2  112  122  132
# 3  113  123  133
# $G2
#   G2S1 G2S2 G2S3
# 1  211  221  231
# 2  212  222  232
# 3  213  223  233
# $G3
#   G3S1 G3S2 G3S3
# 1  311  321  331
# 2  312  322  332
# 3  313  323  333

From here: I personally recommend you keep them in this list-of-frames (ergo my temp name lodf) structure. This is made on the assumption that something you do to one you will also do (the same thing) to the others, in which case lapply is a natural choice for operations.

If you really need to break this down (again, discouraged), you can assign them as the group names to the calling environment with:

for (nm in names(lodf)) assign(nm, lodf[[nm]])
ls()
# [1]  "dat"   "G1"   "G2"   "G3" "lodf"

A related answer dealing with frames-within-lists is How do I make a list of data frames?


You might go one step further, perhaps to deal with this in a "tidy" (i.e., "long") format: incorporate the group into the frames themselves:

do.call(rbind.data.frame, c(mapply(function(nm, x) {
  # remove the 'G#' from each column name, allows row-binding later
  names(x) <- gsub(paste0("^", nm), "", names(x))
  # add the group name as a new column
  transform(x, Grp = nm)
}, names(lodf), lodf, SIMPLIFY = FALSE), list(stringsAsFactors = FALSE)))
#       S1  S2  S3 Grp
# G1.1 111 121 131  G1
# G1.2 112 122 132  G1
# G1.3 113 123 133  G1
# G2.1 211 221 231  G2
# G2.2 212 222 232  G2
# G2.3 213 223 233  G2
# G3.1 311 321 331  G3
# G3.2 312 322 332  G3
# G3.3 313 323 333  G3

(I don't like the row names G1.1 etc, but they're harmless.) Explanation:

  • mapply is similar to lapply except it takes 1 or more arguments per function-call, effectively "zipping" the list arguments together. The arguments to it are names(lodf) and lodf, so the first call (calling anonfunc the anonymous function taking (nm, x) as its arguments) looks like anonfunc(names(lodf)[[1]], lodf[[1]]), the second anonfunc(names(lodf)[[2]], lodf[[2]]), etc.
  • within that anonymous function, I first remove the G# from the column names. This allows us to row-bind them later as S1, S2, etc.
  • I add the group name to the frames as Grp.

This format allows you to do one thing once, applying grouping by the Grp variable as needed. If you're using dplyr or data.table, for instance, it's quite easy to do ... %>% dplyr::group_by(Grp) %>% ... or DT[, .(...), by="Grp"], respectively.

r2evans
  • 141,215
  • 6
  • 77
  • 149