How can I use the concatenate command to automate the merging of several files?

Question

I need to merge two sets of files together. I don't mind what language is used as long as it works. For example: I have a folder with both sets of files

N-4A-2A_S135_L001_R1_001.fastq.gz
N-4A-2A_S135_L001_R2_001.fastq.gz
N-4A-2A_S135_L001_R1_001-2.fastq.gz
N-4A-2A_S135_L001_R2_001-2.fastq.gz

N-4-Bmp-1_S20_L001_R1_001.fastq.gz
N-4-Bmp-1_S20_L001_R2_001.fastq.gz
N-4-Bmp-1_S20_L001_R1_001-2.fastq.gz
N-4-Bmp-1_S20_L001_R2_001-2.fastq.gz

I need them merged like this:

N-4A-2A-S135-L001_R1_001.fastq.gz + N-4A-2A-S135-L001_R1_001-2.fastq.gz > N-4A-2A-S135-L001_R1_001-cat.fastq.gz
N-4A-2A-S135-L001_R2_001.fastq.gz + N-4A-2A-S135-L001_R2_001-2.fastq.gz > N-4A-2A-S135-L001_R2_001-cat.fastq.gz
N-4-Bmp-1_S20_L001_R1_001.fastq.gz + N-4-Bmp-1_S20_L001_R1_001-2.fastq.gz > N-4-Bmp-1_S20_L001_R1_001-cat.fastq.gz
N-4-Bmp-1_S20_L001_R2_001.fastq.gz + N-4-Bmp-1_S20_L001_R2_001-2.fastq.gz > N-4-Bmp-1_S20_L001_R2_001-cat.fastq.gz

I have a total of ~180 files so any way to automate this would be amazing. They all follow the same format of:

(sample name) + "R1_001" or "R2_001" + "" or "-2" + .fastq.gz

The easiest this can be done with a shell script. Please re-tag accordingly. — Eugene Sh., Apr 12 '21 at 13:19
Are you sure it will work to just concatenate them? It's binary files with headers. — klutt, Apr 12 '21 at 13:21
Yeah, there are no headers in the file. I just need the equivalent of cut + paste the contents of one at the end of the other in a new file. — ertictus, Apr 12 '21 at 13:23
https://stackoverflow.com/questions/10347278/how-can-i-copy-several-binary-files-into-one-file-on-a-linux-system — klutt, Apr 12 '21 at 13:24
the cat command works on them individually for what I need, I just need help on making it automated because I have so many files. It's keeping the original sample names for the new file that im stuck on. — ertictus, Apr 12 '21 at 13:29
I've been taught how to do the for i in {1..180} and do it that way previously but because the names aren't in sequence so I don't know how to approach it. — ertictus, Apr 12 '21 at 13:31
Use a wildcard that just matches one set of files. Loop over the results of that, convert the filename to the corresponding second file, then concatenate them. — Barmar, Apr 12 '21 at 15:27
What is the "joining" criteria? You have `N-4A-2A_S135_*` and `N-4-Bmp-1_S20_*`. So, should they be joined on a match of the `S135` vs `S20` subparts? Does the `L001` ever change? Is this a one time merge? Or, will you need to use the program repeatedly over the course of time? — Craig Estey, Apr 12 '21 at 15:37
Does this answer your question? [How can I copy several binary files into one file on a Linux system?](https://stackoverflow.com/questions/10347278/how-can-i-copy-several-binary-files-into-one-file-on-a-linux-system) — bad_coder, Apr 12 '21 at 15:41
AFAIK (AFAICT), you will need to decompress each member of the pair of files, concatenate the decompressed files, and then (re)compress the result. Use a shell script; it's going to be easier. — Jonathan Leffler, Apr 12 '21 at 16:49
@ertictus: If **any** language is fine for you, choose which ever you are most familiar with, and post your own attempts to solve the problem. You just describe what you want to achieve, but I don't see at which point you are stuck. — user1934428, Apr 13 '21 at 06:50

rcwnd_cz · Answer 1 · 2021-04-13T10:15:44.753

1

According to example you provided, your file names look sortable. If this is not the case, this will not work and this answer will be invalid.

Idea of the following snippet is really simple - sort all the filenames and then join adjacent tuples.

This should work in bash:

#!/bin/bash
files=( $( find . -type f | sort -r) )
files_num=${#files[@]}
for (( i=0 ; i < ${files_num} ; i=i+2 )); do
  f1="${files[i]}"
  f2="${files[i+1]}"
  fname=$(basename -s '.gz' "$f1")".cat.gz"
  echo "$f1 and $f2 -> $fname"
  cat "$f1" "$f2" > "$fname"
done

or simpler version in ZShell (zsh)

#!/bin/zsh
files=( $( find . -type f | sort -r) )
for f1 f2 in $files; do
  fname=$(basename -s '.gz' "$f1")".cat.gz"
  echo "$f1 and $f2 -> $fname"
  cat "$f1" "$f2" > "$fname"
done

edited Apr 13 '21 at 10:15

answered Apr 12 '21 at 15:57

rcwnd_cz

909
6
18

Which shell supports `for f1 f2 in …`? It isn't Bash, AFAIK (at least, not the ancient versions of Bash I have on hand). – Jonathan Leffler Apr 12 '21 at 16:47
Oh, good point. I got used to zsh too much I guess! – rcwnd_cz Apr 13 '21 at 10:08

score 0 · Answer 2 · answered Apr 12 '21 at 16:27

You did say any language, so ... I prefer perl

#!/usr/bin/perl
# concatenate files

master(@ARGV);
exit(0);

# master -- master control
sub master
{
    my(@argv) = @_;
    my($tail,@tails);
    local(%tails);

    # scan argv for -opt
    while (1) {
        my($opt) = $argv[0];
        last unless (defined($opt));

        last unless ($opt =~ s/^-/opt_/);
        $$opt = 1;

        shift(@argv);
    }

    # load up the current directory
    @tails = dirload(".");

    foreach $tail (@tails) {
        $tails{$tail} = 0;
    }

    # look for the base name of a file
    foreach $tail (@tails) {
        docat($tail)
            if ($tail =~ /R[12]_001\.fastq\.gz$/);
    }

    # default mode is "dry run"
    unless ($opt_go) {
        printf("\n");
        printf("rerun with -go to actually do it\n");
    }
}

# docat -- process a pairing
sub docat
{
    my($base) = @_;
    my($tail);
    my($out);
    my($cmd);
    my($code);

    # all commands are joining just two files

    $tail = $base;
    $tail =~ s/\.fastq/-2.fastq/;

    # to an output file
    $out = $base;
    $out =~ s/\.fastq/-cat.fastq/;

    $cmd = "cat $base $tail > $out";

    if ($opt_v) {
        printf("\n");
        printf("IN1: %s\n",$base);
        printf("IN2: %s\n",$tail);
        printf("OUT: %s\n",$out);
    }
    else {
        printf("%s\n",$cmd);
    }

    die("docat: duplicate IN1\n")
        if ($tails{$base});
    $tails{$base} = 1;

    die("docat: duplicate IN2\n")
        if ($tails{$tail});
    $tails{$tail} = 1;

    die("docat: duplicate OUT\n")
        if ($tails{$out});
    $tails{$out} = 1;

    {
        last unless ($opt_go);

        # execute the command and get error code
        system($cmd);
        $code = $? >> 8;

        exit($code) if ($code);
    }
}

# dirload -- get list of files in a directory
sub dirload
{
    my($dir) = @_;
    my($xf);
    my($tail);
    my(@tails);

    # open the directory
    opendir($xf,$dir) or
        die("dirload: unable to open '$dir' -- $!\n");

    # get list of files in the directory excluding "." and ".."
    while ($tail = readdir($xf)) {
        next if ($tail eq ".");
        next if ($tail eq "..");
        push(@tails,$tail);
    }

    closedir($xf);

    @tails = sort(@tails);

    @tails;
}

Here's the program output with the -v option:

IN1: N-4-Bmp-1_S20_L001_R1_001.fastq.gz
IN2: N-4-Bmp-1_S20_L001_R1_001-2.fastq.gz
OUT: N-4-Bmp-1_S20_L001_R1_001-cat.fastq.gz

IN1: N-4-Bmp-1_S20_L001_R2_001.fastq.gz
IN2: N-4-Bmp-1_S20_L001_R2_001-2.fastq.gz
OUT: N-4-Bmp-1_S20_L001_R2_001-cat.fastq.gz

IN1: N-4A-2A_S135_L001_R1_001.fastq.gz
IN2: N-4A-2A_S135_L001_R1_001-2.fastq.gz
OUT: N-4A-2A_S135_L001_R1_001-cat.fastq.gz

IN1: N-4A-2A_S135_L001_R2_001.fastq.gz
IN2: N-4A-2A_S135_L001_R2_001-2.fastq.gz
OUT: N-4A-2A_S135_L001_R2_001-cat.fastq.gz

rerun with -go to actually do it

How can I use the concatenate command to automate the merging of several files?

2 Answers2