How to only concatenate files with same identifier using bash script?

Question

I have a directory with files, some have the same ID, which is given in the first part of the file name before the first underscore (always). e.g.:

S100_R1.txt
S100_R2.txt
S111_1_R1.txt
S111_R1.txt
S111_R2.txt
S333_R1.txt

I want to concatenate those identical IDs (and if possible placing the original files in another dir, e.g. output:

original files (folder)
S100_merged.txt
S111_merged.txt
S333_R1.txt

Small note: I imaging that perhaps a solution would be to place all files which will be processed by the code in a new directory and than in a second step move the files with the appended "merged" back to the original dir or something like this...

I am extremely new to bash scripting, so I really can't produce this code. I am use to R language and I can think how it should be but can't write it.

My pitiful attempt is something like this:

while IFS= read -r -d '' id; do
    cat *"$id" > "./${id%.txt}_grouped.txt"
done < <(printf '%s\0' *.txt | cut -zd_ -f1- | sort -uz)

or this:

for ((k=100;k<400;k=k+1)); 
do
IDList= echo "S${k}_S*.txt" | awk -F'[_.]' '{$1}'
while [ IDList${k} == IDList${k+n} ]; do
        cat IDList${k}_S*.txt IDList${k+n}_S*.txt S${k}_S*.txt S${k}_S*.txt >cat/S${k}_merged.txt &; 
  done

Sometimes there are only one version of the file (e.g. S333_R1.txt) sometime two (S100*), three (S111*) or more of the same.

I am prepared for harsh critique for this question because I am so far from a solution, but if someone would be willing to help me out I would greatly appreciate it!

Can the output `S333_R1.txt` file be named `S333_merged.txt`? — KamilCuk, Feb 09 '21 at 11:01
yeah that shouldn't be a big problem... but preferable not. But I'll take it, if that's part of a solution you have :) — Mathilde, Feb 09 '21 at 11:03
`printf '%s\0' *.txt` Does your files really have newlines in filenames? If so, you'll better off with python. — KamilCuk, Feb 09 '21 at 11:12
Honestly (and I am sorry) I don't fully understand the command - I have copy-pasted it from other online code examples and tried to "modulate" it into my situation. "New lines in file name": No, that is not something I have. What would that even look like..... — Mathilde, Feb 09 '21 at 11:16
`touch something$'\n'something_after_newline ; ls` See for yourself :p — KamilCuk, Feb 09 '21 at 11:18
@Mathilde : For a given ID, you can group the files with this id by `cat "$id"_*.txt >"$di"_merged.txt`. — user1934428, Feb 09 '21 at 11:54
@Mathilde : I would separtae the problem into two different subproblems: (1) Find all those IDs which are required for the combining process (`S100` and `S111` in your example). (2) Group those files for a given ID. Write separate shell scripts for each subproblem (so that you can debug them separately). Finally write a master script which peruses the scripts (1) and (2) to achieve the overall desired effect. — user1934428, Feb 09 '21 at 11:57
@user1934428 Thank you for your comments, regarding your first suggestion, as I understand I would need to know the ID and write it e.g. cat "$S100"_*.txt > "$S100"_merged.txt or? That would not be so feasible for a case with many IDs. — Mathilde, Feb 09 '21 at 12:06
@Mathilde : That's why you need the script I refered to as '(1)'. A script which calculates all possible IDs and from them picks only those for which there are at least 2 files starting with the same ID. This is also the way we would do it manually. Hence, you need a script `get_ids` which produces on stdout a list of the ids to safe, and then another one called merge_ids, and finally you combine those. — user1934428, Feb 09 '21 at 12:09
I see. My skills are so damn limited here, it is very frustrating, but hopefully I will get better fast! Would love to be good (or just moderate okay) at this stuff. As always, I am GRATEFUL for the feedback on this forum! — Mathilde, Feb 09 '21 at 12:17

score 2 · Answer 1 · answered Feb 09 '21 at 11:31

while read $fil;
do
  if [[ "$(find . -maxdepth 1 -name $line"_*.txt" | wc -l)" -gt "1" ]]
  then
      cat $line_*.txt >> "$line_merged.txt"
  fi 
done <<< "$(for i in *_*.txt;do echo $i;done | awk -F_ '{ print $1 }')"

Search for files with _.txt and run the output into awk, printing the strings before "_". Run this through a while loop. Check if the number of files for each prefix pattern is greater than 1 using find and if it is, cat the files with that prefix pattern into a merged file.

Thank you for your effort! - See my comment to KamilCuks post, unfortunately I cannot get your script to work, but I think it has to do with Windows vs. Linux. — Mathilde, Feb 09 '21 at 11:57

KamilCuk · Answer 2 · 2021-02-09T11:27:54.430

A plain bash loop with preprocessing:

# first get the list of files
find . -type f |
# then extract the prefix
sed 's@./\([^_]*\)_@\1\t&@' |
# then in a loop merge the files
while IFS=$'\t' read prefix file; do
     cat "$file" >> "${prefix}_merged.txt"
done

That script is iterative - one file at a time. To detect if there is one file of specific prefix, we have to look at all files at a time. So first an awk script to join list of filenames with common prefix:

find . -type f |    # maybe `sort |` ?
# join filenames with common prefix
awk '{ 
      f=$0;                            # remember the file path
      gsub(/.*\//,"");gsub(/_.*/,"");  # extract prefix from filepath and store it in $0
      a[$0]=a[$0]" "f                  # Join path with leading space in associative array indexed with prefix
    }
    # Output prefix and filanames separated by spaces.
    # TBH a tab would be a better separator..
    END{for (i in a) print i a[i]}
' |
# Read input separated by spaces into a bash array
while IFS=' ' read -ra files; do
     #first array element is the prefix
    prefix=${files[0]}
    unset files[0]
    # rest is the files
    case "${#files[@]}" in
    0) echo super error; ;;
    # one file - preserve the filename
    1) cat "${files[@]}" > "$outdir"/"${files[1]}"; ;;
    # more files - do a _merged.txt suffix
    *) cat "${files[@]}" > "$outdir"/"${prefix}_merged.txt"; ;;
    esac
done

Tested on repl.

IDList= echo "S${k}_S*.txt"

Executes the command echo with the environment variable IDList exported and set to empty with one argument equal to S<insert value of k here>_S*.txt.

Filename expansion (ie. * -> list of files) is not executed inside " double quotes.

To assign a result of execution into a variable, use command substitution var=$( something seomthing | seomthing )

IDList${k+n}_S*.txt

The ${var+pattern} is a variable expansion that does not add two variables together. It uses pattern when var is set and does nothing when var is unset. See shell parameter expansion and this my answer on ${var-pattern}, but it's similar.

To add two numbers use arithemtic expansion $((k + n)).

awk -F'[_.]' '{$1}'

$1 is just invalid here. To print a line, print it {print %1}.

Remember to check your scripts with http://shellcheck.net

1st: THANK YOU! looking at this I think, I would have never managed. 2nd: Copying this into a file and running it I get some errors in Line 11 ( $'\r': command not found ), and two errors in line 18 (syntax error near unexpected token `$'in\r'' and ` case "${#files[@]}" in). I am running Windows using Cygwin terminal (but it is only for practice, the command will in the end be running on a Linux PC). — Mathilde, Feb 09 '21 at 11:29
Your file has dos line endings - you saved it wrongly in your editor. Remove dos line endings. Research `dos2unix` and such. — KamilCuk, Feb 09 '21 at 11:32
I really want to thank you KamilCuk for your effort and explanations! Unfortunately I can't get it to work. When I convert it to Unix nothing happen when I run it. The same is the case with @Raman Sailopals solution. But Abelisto's code I can run and it works — Mathilde, Feb 09 '21 at 11:54

Abelisto · Accepted Answer · 2021-02-09T13:04:59.840

1

for id in $(ls | grep -Po '^[^_]+' | uniq) ; do
    if [ $(ls ${id}_*.txt 2> /dev/null | wc -l) -gt 1 ] ; then
        cat ${id}_*.txt > _${id}_merged.txt
        mv ${id}_*.txt folder
    fi
done

for f in _*_merged.txt ; do
    mv ${f} ${f:1}
done

edited Feb 09 '21 at 13:04

answered Feb 09 '21 at 11:35

Abelisto

14,826
2
33
41

1

@Abelisto : This would produce for the example input provided by the OP also a file `S333_merged.txt`, which is not desired. – user1934428 Feb 09 '21 at 11:59
@Mathilde Fixed wrongly moving `*_merged.txt` files to the `folder` – Abelisto Feb 09 '21 at 12:07
I do get a S333_merged in both the long and short version of this script. in addition I get something called 'folder' but it isn't a folder (perhaps a Windows problem) and no files have moved folders. Hope that was okay clear. – Mathilde Feb 09 '21 at 12:15
1

`folder` here is a `dir` from your question: "placing the original files in another `dir`" And it should be created before running this script. – Abelisto Feb 09 '21 at 12:21
ah I see! now the files are moved there. one issue is though that I still get the S333_merged file and also a txt file called "folder_merged", Actually I get for all files in the folder also bash files, which is not optimal. Can it be written to only consider txt files? – Mathilde Feb 09 '21 at 12:53
@Mathilde You could to use `ls ${id}_*.txt 2> /dev/null | wc -l` to count files and do nothing if there is only one file with specified `id`. Answer updated. – Abelisto Feb 09 '21 at 13:06

M. Nejat Aydin · Answer 4 · 2021-02-10T12:29:53.550

1

A pure bash way below. It uses only globs (no need for external commands like ls or find for this question) to enumerate filenames and an associative array (which is supported by bash since the version 4.0) in order to compute frequencies of ids. Parsing ls output to list files is questionable in bash. You may consider reading ParsingLs.

#!/bin/bash

backupdir=original_files # The directory to move the original files
declare -A count # Associative array to hold id counts

# If it is assumed that the backup directory exists prior to call, then
# drop the line below
mkdir "$backupdir" || exit

for file in [^_]*_*; do ((++count[${file%%_*}])); done
for id in "${!count[@]}"; do
    if ((count[$id] > 1)); then
        mv "$id"_* "$backupdir"
        cat "$backupdir/$id"_* > "$id"_merged.txt
    fi
done

edited Feb 10 '21 at 12:29

answered Feb 09 '21 at 15:37

M. Nejat Aydin

9,597
1
7
17

Thank you very much - unfortunately I cannot get it to work, but I am facing some stupid errors which I believe have to do with running on Windows. I will test all the suggested codes on Linux ASAP. thank you all! – Mathilde Feb 10 '21 at 10:33
@Mathilde Do you run the `bash` on Windows? – M. Nejat Aydin Feb 10 '21 at 12:33
yes-ish. My PC is a Windows machine.The script I am trying to write will however in the end run in a Linux environment. But because I need to LEARN so much, it is more feasible for me to sit at my own desk and practice so to speak... I thought running it in a Cygwin terminal would make it work, but yeah no, I am running into trouble. Truth be told seeing all these great answers, I realized that my skills limitations in this are bigger than I thought. So I have actually involved a colleague who is very good at this stuff and has agreed to help me. I hope it doesn't feel like I wasted your time. – Mathilde Feb 11 '21 at 07:50

How to only concatenate files with same identifier using bash script?

4 Answers4