0

I have data frame [Data frame examples], which has 113 entries X 54748 total columns. The column headers look like this:

"SampleID" "metadata_1" "metadata_2" "metadata_3" "Gene_1" "Gene_2" ... "Gene_54748"

The goal is to randomly split the data frame by the "Gene_XXX" columns into 10 smallest data frames. Every new subsetted data frame, must have the same 4 initial columns i.e ["Sampleid" "metadata_1" "metadata_2" "metadata_3"], plus a combination of randomly selected "Gene_XXX" columns, almost equally distributed in number across the 10 subsets.

Example output:

Subset 1:
"SampleID" "metadata_1" "metadata_2" "metadata_3" "Gene_3" "Gene_8" "Gene_4"... "Gene 5474"

Subset 2:
"SampleID" "metadata_1" "metadata_2" "metadata_3" "Gene_1" "Gene_6" "Gene_5"... "Gene 5470"

......

Subset_10:
"SampleID" "metadata_1" "metadata_2" "metadata_3" "Gene_2" "Gene_7" "Gene_9"... "Gene 5472"

So the initial "Genes" will be all present uniquely in the 10 subsets and also randomly distributed (not in order of appearance or alphabetically).

Any idea on how to perform this?

Thank you in advance for any feedback!

Parfait
  • 104,375
  • 17
  • 94
  • 125
GiorgioC
  • 1
  • 2

1 Answers1

0

Consider randomly sampling the column indexes which would be 5:54748 and then split by chunks every 5,475 items for 10 subsets (last with only 8 columns being reminader).

Then run it through an lapply method to build a single list of 10 data frames each with 5,475 columns, extracted by column indexing from original gene_df. All columns should be used once in random order across subsets.

sample.seed(31122)         # SET SEED TO REPRODUCE RANDOMIZATION 
cols <- sample(5:54748)    # SAMPLE WITHOUT REPLACEMENT (DEFAULT)

splits_by_10 <- split(cols, ceiling(seq_along(cols)/5474))

# CREATE NAMED LIST OF DATA FRAMES
sample_dfs <- lapply(
    splits_by_10, function(cols) gene_df[,c(1:4, cols)]
) |> setNames(             # USES NEW PIPE, |> FOR R v4.1.0+
    paste0("subset_", seq_along(splits_by_10))
)

sample_dfs$subset_1
sample_dfs$subset_2
sample_dfs$subset_3
...
sample_dfs$subset_10
Parfait
  • 104,375
  • 17
  • 94
  • 125