1

I'm dealing with my huge .txt data frames generated from microscopic data. Each single .txt output file from it is about 3 to 4 GB! And I have a couple hundreds of them....

For each of those monster file, it has a couple hundreds of features, some are categorical and some are numeric.

Here is an abstract example of the dataframe:

df <- read.csv("output.txt", sep="\t", skip = 9,header=TRUE, fill = T)
df

Row  Column stimulation Compound Concentration treatmentsum Pid_treatmentsum  var1 var2  var3  ...
1    1      uns         Drug1    3             uns_Drug1_3  Jack_uns_Drug1_3  15.0 20.2  3.568 ...
1    1      uns         Drug1    3             uns_Drug1_3  Jack_uns_Drug1_3  55.0 0.20  9.068
1    1      uns         Drug2    5             uns_Drug2_5  Jack_uns_Drug2_5  100  50.2  3.568
1    1      uns         Drug2    5             uns_Drug2_5  Jack_uns_Drug2_5  75.0 60.2  13.68
1    1      3S          Drug1    3             3s_Drug3_3   Jack_3s_Drug1_3   65.0 30.8  6.58
1    1      4S          Drug1    3             4s_Drug3_3   Jack_4s_Drug1_3   35.0 69.3  2.98
.....

And I would like to split the data frame based on common value in a categorical column, the treatmentsum. So I can have all cells treated with the same drug and same dosage together, aka all "uns_Drug1_3" goes to one output.txt.

I have seen similar post so I used split()

sptdf <- split(df, df$treatmentsum)

it worked, as now sptdf gave me lists of data frames. Now I want to write them out as tables, ideally I want to use the "Pid_treatmentsum" element as the name of each splited file's name, as they should have the exact same "Pid_treatmentsum" after splitting. I don't quite know how to do that, so thus far I can at least manual input patient ID and join them by paste

lapply(names(sptdf), function(x){write.table(sptdf[[x]], file = paste("Jack", x, sep = "_"))}) 

This works isn a sense that it writes out all the individual files with correct titles, but they are not .txt and if I open them in excel, I get warning messages that they are corrupted. Meanwhile in R, I get warning messages
Error in file(file, ifelse(append, "a", "w")) : cannot open the connection

Where did I got this wrong?

Given the sheer size of each output file by the microscope (3-4GB), is this the best way to do this?

And if I can push this further, can I dump all hundreds of those huge files in a folder, and could I write a loop to autopmate the process instead of splitting one file a time? the only problem I foresee is the microscope outfiles always have the same name, titled "output".

Thank you in advance, and sorry for the long post.

Cheers, ML

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
ML33M
  • 341
  • 2
  • 19

1 Answers1

0

I don't believe this is very different from the OP's code but here it goes.

First, a test data set. I will use a copy of the built-in data set iris

df <- iris
names(df)[5] <- "Pid_treatmentsum"

Now the file writing code.

sptdf <- split(df, df$Pid_treatmentsum)
lapply(sptdf, function(DF){
  outfile <- as.character(unique(DF[["Pid_treatmentsum"]]))
  outfile <- paste0(outfile, ".txt")
  write.table(DF, 
              file = outfile,
              row.names = FALSE,
              quote = FALSE)
})

If Excel complains that the file is corrupt maybe write.csv (and file extension "csv") will solve the problem.

Edit.

To automate the code above to processing many files, the split/lapply could be rewritten as a function. Then the function would be called passing the filenames as an argument.
Something along the lines of (untested)

splitFun <- function(file, col = "Pid_treatmentsum", ...){
  X <- read.table(file, header = TRUE, ...)
  sptdf <- split(X, X[[col]])
  lapply(sptdf, function(DF){
    outfile <- as.character(unique(DF[[col]]))
    outfile <- paste0(outfile, ".txt")
    write.table(DF,
                file = outfile,
                row.names = FALSE,
                quote = FALSE)
  })
}


filenames <- list.files(pattern = "<a regular expression>")
lapply(filenames, splitFun)
Community
  • 1
  • 1
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thank you, this worked nicely. sorry this might be less creative compared to the OP. I'm too new to programming. I'm learning from all you guys, and if I could ask a dummy question, could you explain the (unique(DF[["Pid_treatmentsum"]])) part? More specifically, what does the DF[[]} do to the string PiD_treatmentsum? Am I correct that you are trying to get a vector/matrix of all "Pid_treatmentsum" from already split data frames, and unique will return a single vector since all of them will have the same "Pid_treatmentsum" in any given split dataframe? – ML33M Feb 18 '20 at 18:17
  • I hope I didnt lose you :( Could I read more about this somewhere? – ML33M Feb 18 '20 at 18:18
  • @ML33M I use `[[` because it's inside a function and when programming that's the recommended extractor, not `$`, which should be reserved for interactive mode. This post [Difference between `[` and `[[`](https://stackoverflow.com/questions/1169456/the-difference-between-bracket-and-double-bracket-for-accessing-the-el) has more details about it. – Rui Barradas Feb 18 '20 at 18:21
  • Thank you mate! This is really nice! For the cheeky question of doing hundreds of these files in the same directory, do I just simply loop through the folder? – ML33M Feb 18 '20 at 18:24
  • @ML33M See the edit, maybe it will give an idea of how it could be done. – Rui Barradas Feb 18 '20 at 18:35
  • Thank you. I have encountered problems with the code when I ran the full dataset (3GB), using just the code to just process one file(which worked fine using the 30MB smaller sample). I have copied error message: Error in file(file, ifelse(append, "a", "w")) : invalid 'description' argument In addition: Warning message: In if (file == "") file <- stdout() else if (is.character(file)) { : the condition has length > 1 and only the first element will be used – ML33M Feb 18 '20 at 19:01