0

I have a folder called simulations that contains 100 sub-folders, each of which contains the results of simulations. The results of each simulation in each sub-folder are in four separate files named, seq[1].nex, seq[2].nex, seq[3].nex, and seq[4].nex. Each of these files has the same format, which is as follows:

#NEXUS

Begin data;
Dimensions ntax=5 nchar=55;
Format datatype=Standard symbols="01" missing=? gap=-;
Matrix
L1   1100110010010100010110000110000010000100001011010010110
L2   1101110110011010010000010111000010010000001001010110110
L3   0111111100010100010011000001100011010100010010110011110
L4   1101110110011010010000010111000010010000001001010110110
L5   1101110100110100010110010110001010010100001011010110100
;
End;

The files named seq have the same number of rows (i.e., L1-L5), but they differ in the length of each row. For instance, seq[2].nex looks as follows:

#NEXUS

Begin data;
Dimensions ntax=5 nchar=20;
Format datatype=Standard symbols="012" missing=? gap=-;
Matrix
L1   10000012202011210001
L2   10002112212010210012
L3   10002112212210220022
L4   10002112212010220012
L5   10001112212010222012 
;
End;

For each of the 100 sub-folders, I want to merge seq[1].nex, seq[2].nex, seq[3].nex, and seq[4].nex into one file seq.nex. Starting with seq[1].nex, I want to append the information from the later files (i.e., 2-4) to its corresponding row in the first file. Using the two examples above, the output that I want would look like this:

#NEXUS

Begin data;
Dimensions ntax=5 nchar=55;
Format datatype=Standard symbols="01" missing=? gap=-;
Matrix
L1   110011001001010001011000011000001000010000101101001011010000012202011210001
L2   110111011001101001000001011100001001000000100101011011010002112212010210012
L3   011111110001010001001100000110001101010001001011001111010002112212210220022
L4   110111011001101001000001011100001001000000100101011011010002112212010220012
L5   110111010011010001011001011000101001010000101101011010010001112212010222012
;
End;

I then want to repeat this process of merging the file for each of the 100 sub-folder. Is there a way to do this in R?

Namenlos
  • 475
  • 5
  • 17
  • That looks to be a rather specific file format, but since it's consistent, I suggest: (1) figure out how to reliably load in _one_ file; (2) read [list of frames](https://stackoverflow.com/a/24376207/3358227) for ways to repeat that for a number of files; then (3) figure out how to use `Reduce(..)` on that, or come back here with a little more elbow-grease and we can help with that last step. – r2evans Mar 30 '23 at 20:32

1 Answers1

0

Here is one approach:

library(data.table)

# get path to simulations folder
pth_to_simulations = "simulations"

# get a list of all subfolders, with full names
fldrs = dir(pth_to_simulations, full.names=T)

# Create a function that ingests a subfolder, reads files, and concatenates
read_sims <- function(fldr) {
  sims = dir(fldr,full.names = T)
  sims = lapply(sims, fread, skip=6, nrows=5, header=F)
  sims = do.call(merge, c(by="V1", sims))
  sims[, .(V2 = paste0(c(.SD), collapse="")), V1]
}

# Apply the function to each of the fldrs in `simulations`
lapply(fldrs, read_sims)

If your example files are in simulations/sim1, then the result is as follows:

[[1]]
   V1                                                                          V2
1: L1 110011001001010001011000011000001000010000101101001011010000012202011210001
2: L2 110111011001101001000001011100001001000000100101011011010002112212010210012
3: L3 011111110001010001001100000110001101010001001011001111010002112212210220022
4: L4 110111011001101001000001011100001001000000100101011011010002112212010220012
5: L5 110111010011010001011001011000101001010000101101011010010001112212010222012

This output is a list of length 1, because there is only one folder (`sim1). Your output would be list of length 100, with each element containing the concatenated information

langtang
  • 22,248
  • 1
  • 12
  • 27
  • Thank you very much! When I try to run this code, I get the following error message: `Error in if (!sort %in% c(TRUE, FALSE)) stop("Argument 'sort' should be logical TRUE/FALSE") : the condition has length > 1.` I'm not sure what the problem is. – Namenlos Mar 31 '23 at 07:47