0

I have this big file names Objects_Population - AllCells.txt that is ~3GB, the file has 25704373 rows and 132 variables. I want to read the file and split the rows based on one variable, which is the column named treatmentsum. In this column, I have experimental drug treatments under different conditions (3S or UNS), that is strings linked with "_". So the split will put all rows with the same treatment together. After split the file, I want to write out the split files and give the file names using the treatmentsum.

My code is below :

#load libraries
library(tidyverse)
library(vroom)
library(dplyr)
library(stringr)

#read in the file, skip the first 9 rows
files<-vroom("Objects_Population - AllCells.txt", delim = "\t",skip = 9,col_names = T)

#split the files based on treatmentsum
splited<- files %>% 
  group_split(files$treatmentsum)

#write out the splitted files
output<- lapply(splited, function(i){
  for (i in 1:length(splited)) {
    write.table(splited[[i]][,1:131],file=paste(unique(splited[[i]]$treatmentsum),".txt"), sep="\t", row.names=FALSE)

  }
 })

So when I run it, the file read correctly, and the split worked fine and treatments are spitted as expected, that is I get a list of 1092 (shown in the environment), each list contains the rows with the same treatment. However it the code dies every time after it writes me 233 files. I have screened shot the error, and all the files generated are 3S, no UNS files generated (as you can see in the right bottom file directory screenshot). Can someone help me with this and let me know what the error means?enter image description here

ML33M
  • 341
  • 2
  • 19
  • 1
    Try printing out the value of `paste(unique(splited[[i]]$treatmentsum),".txt")` in your loop. Are you sure you are supplying a valid file name each time? It's very difficult to help without any sort of [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – MrFlick Sep 23 '20 at 19:39
  • @MrFlick Hi, sorry for the sloppy work. I didn't know how to present the full file as minimal/sampling might not have the problem covered. I'm tryin gto subsample the file and see if things work. I did what you suggested to print out the names, 1092 individual different names are printed as expected. – ML33M Sep 23 '20 at 19:56
  • @MrFlick Still not working. hmmm, is it possible to message you direcly or send you a subsampled file to test? – ML33M Sep 23 '20 at 20:19
  • @MrFlick actually I think I got it, hear me out. The new txt file names contains "/" (thanks to biology), that might cause problems when writing, from what I read. So I gsub the /, and it's now writing beyong 233 files. – ML33M Sep 23 '20 at 20:57

1 Answers1

0

I figured out some of the file names due to the name of treatments will have "/" in it. Inspired by this https://stackoverflow.com/a/49647853/12362355

library(tidyverse)
library(vroom)
library(dplyr)
library(stringr)
files<-vroom("Objects_Population - AllCells.txt", delim = "\t",skip = 9,col_names = T)


splited<- files %>% 
  group_split(files$treatmentsum)



output<- lapply(splited, function(i){
  for (i in 1:length(splited)) {
    write.table(splited[[i]][,1:131],file=paste0(gsub("/","",unique(splited[[i]]$treatmentsum)),".txt"), sep="\t", 
row.names=FALSE)

  }
 })
ML33M
  • 341
  • 2
  • 19