I'm starting to work with larger datasets and I need a way to automate the functions I already use. The framework I'm hoping to achieve would apply a loop within dplyr select() to identify control groups and the corresponding experimental groups within a large data frame, then perform the function(s), save the output, and finally repeat the loop for all samples in the dataset.
The data is complicated in that I'm often working with datasets that contain 3000+ samples, so writting out every sample in the column headers would defeat the efficiency of using a computer program.
The layout of the data is really simple. Each control group is labeled as "control", each experimental group is labeled as "exper". To pair each control to it's experimental group, I've added a numerical component such that control1 will correspond to exper1, control2 to exper2, etc. Finally, many experimental groups correspond to the same control so I've added an alphabetical component: exper1a and exper1b both correspond to control1.
Here is an example of a table header:
variable control1 control1 exper1a exper1a exper1b exper1b
EDIT ### please see the reproducible data provided by Pierre in the checked answer### Sorry for not providing data of my own.
I'd like the code to look for the first integer of "control", then find the corresponding exper with the same integer. Within the group of eprimental groups that correspond to control1 I'd like to identify the first alapbetic group, and save the resulting table. For example:
variable control1 control1 exper1a exper1a
Next, I'd like the code to repeat this function for all other pairs of exper goups with control1 (ie exper1b), and once all exper groups have been exhausted for control1, move onto the control2.., and repeat this until no more control groups remain.
Basically I just want a bunch of small tables made from one big table.
Here is what I've pieced together so far that works:
library(readr)
library(dplyr)
df<-read.csv("big_table.csv")
I know it's pathetic, but It's better than listing the 100+ failed attempts I've tried. here is the theoretical framework I imagined working:
# define i as a list of integers without a limit
# define n as a list alpahbetical characters, start at a, stop at z
for(i in 1:ncol(x)) {
select(df, variable, contains(i))
for(n in a:ncol(x)){
selcect(df, variable, contains(n))
write.csv("controli_experin.csv")
# where i in controli is the variable "i", and "in" represents the variables "i" and "n" in experin.
}}
Hopefully this gives everyone an idea of what I'm trying to do. Ideally, I will be able to apply this code to large and small datasets as long as the column nomenclature follows the same pattern.
Thank you in advance for your help.