I'm dealing with my huge .txt data frames generated from microscopic data. Each single .txt output file from it is about 3 to 4 GB! And I have a couple hundreds of them....
For each of those monster file, it has a couple hundreds of features, some are categorical and some are numeric.
Here is an abstract example of the dataframe:
df <- read.csv("output.txt", sep="\t", skip = 9,header=TRUE, fill = T)
df
Row Column stimulation Compound Concentration treatmentsum Pid_treatmentsum var1 var2 var3 ...
1 1 uns Drug1 3 uns_Drug1_3 Jack_uns_Drug1_3 15.0 20.2 3.568 ...
1 1 uns Drug1 3 uns_Drug1_3 Jack_uns_Drug1_3 55.0 0.20 9.068
1 1 uns Drug2 5 uns_Drug2_5 Jack_uns_Drug2_5 100 50.2 3.568
1 1 uns Drug2 5 uns_Drug2_5 Jack_uns_Drug2_5 75.0 60.2 13.68
1 1 3S Drug1 3 3s_Drug3_3 Jack_3s_Drug1_3 65.0 30.8 6.58
1 1 4S Drug1 3 4s_Drug3_3 Jack_4s_Drug1_3 35.0 69.3 2.98
.....
And I would like to split the data frame based on common value in a categorical column, the treatmentsum. So I can have all cells treated with the same drug and same dosage together, aka all "uns_Drug1_3" goes to one output.txt.
I have seen similar post so I used split()
sptdf <- split(df, df$treatmentsum)
it worked, as now sptdf gave me lists of data frames. Now I want to write them out as tables, ideally I want to use the "Pid_treatmentsum" element as the name of each splited file's name, as they should have the exact same "Pid_treatmentsum" after splitting. I don't quite know how to do that, so thus far I can at least manual input patient ID and join them by paste
lapply(names(sptdf), function(x){write.table(sptdf[[x]], file = paste("Jack", x, sep = "_"))})
This works isn a sense that it writes out all the individual files with correct titles, but they are not .txt and if I open them in excel, I get warning messages that they are corrupted. Meanwhile in R, I get warning messages
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
Where did I got this wrong?
Given the sheer size of each output file by the microscope (3-4GB), is this the best way to do this?
And if I can push this further, can I dump all hundreds of those huge files in a folder, and could I write a loop to autopmate the process instead of splitting one file a time? the only problem I foresee is the microscope outfiles always have the same name, titled "output".
Thank you in advance, and sorry for the long post.
Cheers, ML