Split a dataframe based on a column and write out the multiple split .txt files with specific names

Question

I'm dealing with my huge .txt data frames generated from microscopic data. Each single .txt output file from it is about 3 to 4 GB! And I have a couple hundreds of them....

For each of those monster file, it has a couple hundreds of features, some are categorical and some are numeric.

Here is an abstract example of the dataframe:

df <- read.csv("output.txt", sep="\t", skip = 9,header=TRUE, fill = T)
df

Row  Column stimulation Compound Concentration treatmentsum Pid_treatmentsum  var1 var2  var3  ...
1    1      uns         Drug1    3             uns_Drug1_3  Jack_uns_Drug1_3  15.0 20.2  3.568 ...
1    1      uns         Drug1    3             uns_Drug1_3  Jack_uns_Drug1_3  55.0 0.20  9.068
1    1      uns         Drug2    5             uns_Drug2_5  Jack_uns_Drug2_5  100  50.2  3.568
1    1      uns         Drug2    5             uns_Drug2_5  Jack_uns_Drug2_5  75.0 60.2  13.68
1    1      3S          Drug1    3             3s_Drug3_3   Jack_3s_Drug1_3   65.0 30.8  6.58
1    1      4S          Drug1    3             4s_Drug3_3   Jack_4s_Drug1_3   35.0 69.3  2.98
.....

And I would like to split the data frame based on common value in a categorical column, the treatmentsum. So I can have all cells treated with the same drug and same dosage together, aka all "uns_Drug1_3" goes to one output.txt.

I have seen similar post so I used split()

sptdf <- split(df, df$treatmentsum)

it worked, as now sptdf gave me lists of data frames. Now I want to write them out as tables, ideally I want to use the "Pid_treatmentsum" element as the name of each splited file's name, as they should have the exact same "Pid_treatmentsum" after splitting. I don't quite know how to do that, so thus far I can at least manual input patient ID and join them by paste

lapply(names(sptdf), function(x){write.table(sptdf[[x]], file = paste("Jack", x, sep = "_"))})

This works isn a sense that it writes out all the individual files with correct titles, but they are not .txt and if I open them in excel, I get warning messages that they are corrupted. Meanwhile in R, I get warning messages
Error in file(file, ifelse(append, "a", "w")) : cannot open the connection

Where did I got this wrong?

Given the sheer size of each output file by the microscope (3-4GB), is this the best way to do this?

And if I can push this further, can I dump all hundreds of those huge files in a folder, and could I write a loop to autopmate the process instead of splitting one file a time? the only problem I foresee is the microscope outfiles always have the same name, titled "output".

Thank you in advance, and sorry for the long post.

Cheers, ML

If you want the correct extension you can try `paste0("Jack_", x, ".txt")` to name your files in the `lapply` — Allan Cameron, Feb 18 '20 at 17:29

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

0

I don't believe this is very different from the OP's code but here it goes.

First, a test data set. I will use a copy of the built-in data set iris

df <- iris
names(df)[5] <- "Pid_treatmentsum"

Now the file writing code.

sptdf <- split(df, df$Pid_treatmentsum)
lapply(sptdf, function(DF){
  outfile <- as.character(unique(DF[["Pid_treatmentsum"]]))
  outfile <- paste0(outfile, ".txt")
  write.table(DF, 
              file = outfile,
              row.names = FALSE,
              quote = FALSE)
})

If Excel complains that the file is corrupt maybe write.csv (and file extension "csv") will solve the problem.

Edit.

To automate the code above to processing many files, the split/lapply could be rewritten as a function. Then the function would be called passing the filenames as an argument.
Something along the lines of (untested)

splitFun <- function(file, col = "Pid_treatmentsum", ...){
  X <- read.table(file, header = TRUE, ...)
  sptdf <- split(X, X[[col]])
  lapply(sptdf, function(DF){
    outfile <- as.character(unique(DF[[col]]))
    outfile <- paste0(outfile, ".txt")
    write.table(DF,
                file = outfile,
                row.names = FALSE,
                quote = FALSE)
  })
}


filenames <- list.files(pattern = "<a regular expression>")
lapply(filenames, splitFun)

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 18 '20 at 17:47

Rui Barradas

70,273
8
34
66

Thank you, this worked nicely. sorry this might be less creative compared to the OP. I'm too new to programming. I'm learning from all you guys, and if I could ask a dummy question, could you explain the (unique(DF[["Pid_treatmentsum"]])) part? More specifically, what does the DF[[]} do to the string PiD_treatmentsum? Am I correct that you are trying to get a vector/matrix of all "Pid_treatmentsum" from already split data frames, and unique will return a single vector since all of them will have the same "Pid_treatmentsum" in any given split dataframe? – ML33M Feb 18 '20 at 18:17
I hope I didnt lose you :( Could I read more about this somewhere? – ML33M Feb 18 '20 at 18:18
@ML33M I use `[[` because it's inside a function and when programming that's the recommended extractor, not `$`, which should be reserved for interactive mode. This post [Difference between `[` and `[[`](https://stackoverflow.com/questions/1169456/the-difference-between-bracket-and-double-bracket-for-accessing-the-el) has more details about it. – Rui Barradas Feb 18 '20 at 18:21
Thank you mate! This is really nice! For the cheeky question of doing hundreds of these files in the same directory, do I just simply loop through the folder? – ML33M Feb 18 '20 at 18:24
@ML33M See the edit, maybe it will give an idea of how it could be done. – Rui Barradas Feb 18 '20 at 18:35
Thank you. I have encountered problems with the code when I ran the full dataset (3GB), using just the code to just process one file(which worked fine using the 30MB smaller sample). I have copied error message: Error in file(file, ifelse(append, "a", "w")) : invalid 'description' argument In addition: Warning message: In if (file == "") file <- stdout() else if (is.character(file)) { : the condition has length > 1 and only the first element will be used – ML33M Feb 18 '20 at 19:01

Split a dataframe based on a column and write out the multiple split .txt files with specific names

1 Answers1

Edit.