0

I would like to separate one dataframe in two using R. As an example, having one dataframe 70% of the original content and the other one having 30%. How could I do that? My dataframe is of size (22740,2).

My dataframe consists in one column having genes and in the other column having the pathway where it belongs. I want to keep that 70-30 relation in EVERY pathway of the dataframe. Therefore, I am not interesting in taking the first 70% rows and do a new dataframe for example.

Hope I explained myself clearly.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Maik
  • 170
  • 7

2 Answers2

1

Using dplyr, df2 is the 70%, df3 is the 30% - ref is created to index the entries. The group_by ensures that each pathway is sampled individually.

library(dplyr)
df2 <- df %>% mutate(ref=seq_len(nrow(df))) %>% group_by(pathway) %>% sample_frac(0.7)
df3 <- df[-df2$ref,]
Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
0

If you want a random selection of the 30% of the samples, you can do:

   # Select a 30% of the samples
     Sel.ID <- sample(1:22740,size = .3*22740,replace=F)
   # The new table with the 30% of the samples would be . . .
     New.Tab.30 <- Tab[Sel.ID,]
   # The table with the 70% of the samples (the remaining) would be . . .
     New.Tab.70 <- Tab[-Sel.ID,]

You can run different times, getting different tables. If you want to keep the same, you should use set.seed(12345) for example before the first line.

R18
  • 1,476
  • 1
  • 8
  • 17