-1

I need to merge two sets of files together. I don't mind what language is used as long as it works. For example: I have a folder with both sets of files

N-4A-2A_S135_L001_R1_001.fastq.gz
N-4A-2A_S135_L001_R2_001.fastq.gz
N-4A-2A_S135_L001_R1_001-2.fastq.gz
N-4A-2A_S135_L001_R2_001-2.fastq.gz

N-4-Bmp-1_S20_L001_R1_001.fastq.gz
N-4-Bmp-1_S20_L001_R2_001.fastq.gz
N-4-Bmp-1_S20_L001_R1_001-2.fastq.gz
N-4-Bmp-1_S20_L001_R2_001-2.fastq.gz

I need them merged like this:

N-4A-2A-S135-L001_R1_001.fastq.gz + N-4A-2A-S135-L001_R1_001-2.fastq.gz > N-4A-2A-S135-L001_R1_001-cat.fastq.gz
N-4A-2A-S135-L001_R2_001.fastq.gz + N-4A-2A-S135-L001_R2_001-2.fastq.gz > N-4A-2A-S135-L001_R2_001-cat.fastq.gz
N-4-Bmp-1_S20_L001_R1_001.fastq.gz + N-4-Bmp-1_S20_L001_R1_001-2.fastq.gz > N-4-Bmp-1_S20_L001_R1_001-cat.fastq.gz
N-4-Bmp-1_S20_L001_R2_001.fastq.gz + N-4-Bmp-1_S20_L001_R2_001-2.fastq.gz > N-4-Bmp-1_S20_L001_R2_001-cat.fastq.gz

I have a total of ~180 files so any way to automate this would be amazing. They all follow the same format of:

(sample name) + "R1_001" or "R2_001" + "" or "-2" + .fastq.gz
ertictus
  • 1
  • 1
  • 1
    And why on earth do you want to do this in C? – klutt Apr 12 '21 at 13:14
  • I'd do it any way as long as they are concatenated the same – ertictus Apr 12 '21 at 13:16
  • 1
    The easiest this can be done with a shell script. Please re-tag accordingly. – Eugene Sh. Apr 12 '21 at 13:19
  • Are you sure it will work to just concatenate them? It's binary files with headers. – klutt Apr 12 '21 at 13:21
  • Yeah, there are no headers in the file. I just need the equivalent of cut + paste the contents of one at the end of the other in a new file. – ertictus Apr 12 '21 at 13:23
  • https://stackoverflow.com/questions/10347278/how-can-i-copy-several-binary-files-into-one-file-on-a-linux-system – klutt Apr 12 '21 at 13:24
  • the cat command works on them individually for what I need, I just need help on making it automated because I have so many files. It's keeping the original sample names for the new file that im stuck on. – ertictus Apr 12 '21 at 13:29
  • I've been taught how to do the for i in {1..180} and do it that way previously but because the names aren't in sequence so I don't know how to approach it. – ertictus Apr 12 '21 at 13:31
  • Use a wildcard that just matches one set of files. Loop over the results of that, convert the filename to the corresponding second file, then concatenate them. – Barmar Apr 12 '21 at 15:27
  • E.g. `for file in N-4A-2A_S135_L001_*.fastq.gz` – Barmar Apr 12 '21 at 15:29
  • What is the "joining" criteria? You have `N-4A-2A_S135_*` and `N-4-Bmp-1_S20_*`. So, should they be joined on a match of the `S135` vs `S20` subparts? Does the `L001` ever change? Is this a one time merge? Or, will you need to use the program repeatedly over the course of time? – Craig Estey Apr 12 '21 at 15:37
  • Does this answer your question? [How can I copy several binary files into one file on a Linux system?](https://stackoverflow.com/questions/10347278/how-can-i-copy-several-binary-files-into-one-file-on-a-linux-system) – bad_coder Apr 12 '21 at 15:41
  • AFAIK (AFAICT), you will need to decompress each member of the pair of files, concatenate the decompressed files, and then (re)compress the result. Use a shell script; it's going to be easier. – Jonathan Leffler Apr 12 '21 at 16:49
  • @ertictus: If **any** language is fine for you, choose which ever you are most familiar with, and post your own attempts to solve the problem. You just describe what you want to achieve, but I don't see at which point you are stuck. – user1934428 Apr 13 '21 at 06:50

2 Answers2

1

According to example you provided, your file names look sortable. If this is not the case, this will not work and this answer will be invalid.

Idea of the following snippet is really simple - sort all the filenames and then join adjacent tuples.

This should work in bash:

#!/bin/bash
files=( $( find . -type f | sort -r) )
files_num=${#files[@]}
for (( i=0 ; i < ${files_num} ; i=i+2 )); do
  f1="${files[i]}"
  f2="${files[i+1]}"
  fname=$(basename -s '.gz' "$f1")".cat.gz"
  echo "$f1 and $f2 -> $fname"
  cat "$f1" "$f2" > "$fname"
done

or simpler version in ZShell (zsh)

#!/bin/zsh
files=( $( find . -type f | sort -r) )
for f1 f2 in $files; do
  fname=$(basename -s '.gz' "$f1")".cat.gz"
  echo "$f1 and $f2 -> $fname"
  cat "$f1" "$f2" > "$fname"
done
rcwnd_cz
  • 909
  • 6
  • 18
0

You did say any language, so ... I prefer perl

#!/usr/bin/perl
# concatenate files

master(@ARGV);
exit(0);

# master -- master control
sub master
{
    my(@argv) = @_;
    my($tail,@tails);
    local(%tails);

    # scan argv for -opt
    while (1) {
        my($opt) = $argv[0];
        last unless (defined($opt));

        last unless ($opt =~ s/^-/opt_/);
        $$opt = 1;

        shift(@argv);
    }

    # load up the current directory
    @tails = dirload(".");

    foreach $tail (@tails) {
        $tails{$tail} = 0;
    }

    # look for the base name of a file
    foreach $tail (@tails) {
        docat($tail)
            if ($tail =~ /R[12]_001\.fastq\.gz$/);
    }

    # default mode is "dry run"
    unless ($opt_go) {
        printf("\n");
        printf("rerun with -go to actually do it\n");
    }
}

# docat -- process a pairing
sub docat
{
    my($base) = @_;
    my($tail);
    my($out);
    my($cmd);
    my($code);

    # all commands are joining just two files

    $tail = $base;
    $tail =~ s/\.fastq/-2.fastq/;

    # to an output file
    $out = $base;
    $out =~ s/\.fastq/-cat.fastq/;

    $cmd = "cat $base $tail > $out";

    if ($opt_v) {
        printf("\n");
        printf("IN1: %s\n",$base);
        printf("IN2: %s\n",$tail);
        printf("OUT: %s\n",$out);
    }
    else {
        printf("%s\n",$cmd);
    }

    die("docat: duplicate IN1\n")
        if ($tails{$base});
    $tails{$base} = 1;

    die("docat: duplicate IN2\n")
        if ($tails{$tail});
    $tails{$tail} = 1;

    die("docat: duplicate OUT\n")
        if ($tails{$out});
    $tails{$out} = 1;

    {
        last unless ($opt_go);

        # execute the command and get error code
        system($cmd);
        $code = $? >> 8;

        exit($code) if ($code);
    }
}

# dirload -- get list of files in a directory
sub dirload
{
    my($dir) = @_;
    my($xf);
    my($tail);
    my(@tails);

    # open the directory
    opendir($xf,$dir) or
        die("dirload: unable to open '$dir' -- $!\n");

    # get list of files in the directory excluding "." and ".."
    while ($tail = readdir($xf)) {
        next if ($tail eq ".");
        next if ($tail eq "..");
        push(@tails,$tail);
    }

    closedir($xf);

    @tails = sort(@tails);

    @tails;
}

Here's the program output with the -v option:

IN1: N-4-Bmp-1_S20_L001_R1_001.fastq.gz
IN2: N-4-Bmp-1_S20_L001_R1_001-2.fastq.gz
OUT: N-4-Bmp-1_S20_L001_R1_001-cat.fastq.gz

IN1: N-4-Bmp-1_S20_L001_R2_001.fastq.gz
IN2: N-4-Bmp-1_S20_L001_R2_001-2.fastq.gz
OUT: N-4-Bmp-1_S20_L001_R2_001-cat.fastq.gz

IN1: N-4A-2A_S135_L001_R1_001.fastq.gz
IN2: N-4A-2A_S135_L001_R1_001-2.fastq.gz
OUT: N-4A-2A_S135_L001_R1_001-cat.fastq.gz

IN1: N-4A-2A_S135_L001_R2_001.fastq.gz
IN2: N-4A-2A_S135_L001_R2_001-2.fastq.gz
OUT: N-4A-2A_S135_L001_R2_001-cat.fastq.gz

rerun with -go to actually do it
Craig Estey
  • 30,627
  • 4
  • 24
  • 48