-2

I'm starting to work with larger datasets and I need a way to automate the functions I already use. The framework I'm hoping to achieve would apply a loop within dplyr select() to identify control groups and the corresponding experimental groups within a large data frame, then perform the function(s), save the output, and finally repeat the loop for all samples in the dataset.

The data is complicated in that I'm often working with datasets that contain 3000+ samples, so writting out every sample in the column headers would defeat the efficiency of using a computer program.

The layout of the data is really simple. Each control group is labeled as "control", each experimental group is labeled as "exper". To pair each control to it's experimental group, I've added a numerical component such that control1 will correspond to exper1, control2 to exper2, etc. Finally, many experimental groups correspond to the same control so I've added an alphabetical component: exper1a and exper1b both correspond to control1.

Here is an example of a table header:

variable control1 control1 exper1a exper1a exper1b exper1b

EDIT ### please see the reproducible data provided by Pierre in the checked answer### Sorry for not providing data of my own.

I'd like the code to look for the first integer of "control", then find the corresponding exper with the same integer. Within the group of eprimental groups that correspond to control1 I'd like to identify the first alapbetic group, and save the resulting table. For example:

variable control1 control1 exper1a exper1a

Next, I'd like the code to repeat this function for all other pairs of exper goups with control1 (ie exper1b), and once all exper groups have been exhausted for control1, move onto the control2.., and repeat this until no more control groups remain.

Basically I just want a bunch of small tables made from one big table.

Here is what I've pieced together so far that works:

library(readr)
library(dplyr)
df<-read.csv("big_table.csv")

I know it's pathetic, but It's better than listing the 100+ failed attempts I've tried. here is the theoretical framework I imagined working:

# define i as a list of integers without a limit
# define n as a list alpahbetical characters, start at a, stop at z
for(i in 1:ncol(x)) {
    select(df, variable, contains(i))
    for(n in a:ncol(x)){
        selcect(df, variable, contains(n))
write.csv("controli_experin.csv")
# where i in controli is the variable "i", and "in" represents the variables "i" and "n" in experin.
}}

Hopefully this gives everyone an idea of what I'm trying to do. Ideally, I will be able to apply this code to large and small datasets as long as the column nomenclature follows the same pattern.

Thank you in advance for your help.

Phil_T
  • 942
  • 9
  • 27
  • It's bad practice to name two columns identically, e.g. `variable control1 control1 exper1 exper1`. If you call `df[,"control1"]`, how would you (or the cpu) know what column to expect? – Pierre L Dec 31 '15 at 18:26
  • A couple comments: (1) since we don't have your CSV, showing us the code you use to read it in is pretty pointless (especially as your question has nothing to do with reading in data. Rather, you should try to make your question [reproducible](http://stackoverflow.com/q/5963269/903061), in this case the easiest way is probably to provide code for simulating data. (2) You might want to read up on [tidy data](http://vita.had.co.nz/papers/tidy-data.pdf), you seem to have "groups" of columns, which would probably do better in long form than their current wide form. – Gregor Thomas Dec 31 '15 at 18:27
  • And (3) if you insist on using variable columns rather than converting to long form and just using `group_by`, there's a [very nice vignetted on standard and non-standard evaluation in dplyr](https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html) that should be your starting point. – Gregor Thomas Dec 31 '15 at 18:28
  • Gregor - Assuming I transpose the table, and break the current column names such that I no longer have three variables stored in the name, how do I write a loop to help group data for all instances of my control and experimental groups into separate files? – Phil_T Dec 31 '15 at 19:08
  • Please invest some time in [making your question reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Not only will it convey what you're trying to do better, it will also make you think harder about your data. – Roman Luštrik Dec 31 '15 at 20:27
  • Sorry for not making my question reproducible. In truth, all I had was a massive .csv file with a header that I didn't know how to manipulate. The comments from Gregor were very helpful, and I was able to modify my table using tidyr, dyplyr, split, followed by a series lappy to achive my end goal. Thanks everyone for the help. – Phil_T Jan 01 '16 at 19:58

1 Answers1

1

Here's an approach with a reproducible example. Basically we look for the numbers in the column names and combine each set of numbers with the first column:

library(stringr)
nums <- as.integer(str_extract(names(df1)[-1], "\\d+"))
lapply(unique(nums), function(x) cbind(df1[1], df1[,c(F,nums == x)]))
# [[1]]
#      variable     ctrl1a     ctrl1b     expr1a      expr1b
# 1 -0.08051177  0.6160930 -0.1509403 -1.1601206  0.84550630
# 2  0.69528553  0.9835623 -0.2764147  0.9141205  1.74502652
# 3 -1.75956316 -0.4281811  0.4065561 -0.4831384 -0.56058428
# 4  0.36156427  1.1100194  1.2543866 -0.7526051  0.50255499
# 5  2.08312138 -0.9252150  1.4087865 -0.5825413 -0.12795336
# 6 -0.35242288  1.0341357  0.3862441  1.0471908 -0.45732123
# 7 -0.45336626  0.3514331 -0.7843920  0.6576382 -1.39281599
# 8  1.34009843 -0.4036598 -0.4233619 -0.1027439  0.04361829
# 9 -0.06922307 -1.6241054  0.6889851  0.9623857 -0.26017311
# 
# [[2]]
#      variable     ctrl2a      ctrl2b     expr2a     expr2b
# 1 -0.08051177 -0.5602536 -0.09769381 -0.5693422  0.5097039
# 2  0.69528553  2.4807524  0.67856805 -0.5307254 -0.8791350
# 3 -1.75956316  0.1682293 -0.05482072  0.1568936 -0.4955552
# 4  0.36156427 -0.2044310  0.80732824 -1.6120776  0.2843847
# 5  2.08312138  0.3807844  0.99740970 -1.6695074 -0.5224782
# 6 -0.35242288 -1.1126282  0.62537929 -0.6152809  1.2340156
# 7 -0.45336626 -2.6516954  1.74563891  0.5517721  1.6779573
# 8  1.34009843 -1.3851439 -0.62581919 -1.1300818 -0.1788827
# 9 -0.06922307 -1.0398341 -0.89502608 -1.1915519  0.3488837

Reproducible data

set.seed(1231)
df1 <- data.frame(matrix(rnorm(81), 9,9))
names(df1) <- c("variable", paste0(rep(c("ctrl", "expr"), 2, each=2), rep(1:2, each=4), c("a", "b")))
Pierre L
  • 28,203
  • 6
  • 47
  • 69