How to concatenate lines based on one line in common?

Question

I have a tab separated file that looks like this:

4S2P_1:A    4S2P_1:A
4S2P_1:A    6PXX_1:A
4S2P_1:A    6HB8_1:A
4S2P_1:A    6HOO_1:A
4S2P_1:A    6I5D_1:A
4S2R_1:A    4S2R_1:A
4S2C_1:A    4S2C_1:A
4S2C_1:A    4S2B_1:A
4S2E_1:A    4S2E_1:A
4S2E_1:A    5XB5_1:A
4S2E_1:A    5XBH_1:A

The file is created so that in the second column are the sequences similar to the ones in the first column. 4S2P_1:A is similar to itself and 6Q5B_1:A and 6PXX_1:A and 6HB8_1:A and so on. 4S2R_1:A is just similar to itself.

I want to parse the file to look like this:

4S2P_1:A 6PXX_1:A 6HB8_1:A 6HOO_1:A 6I5D_1:A
4S2E_1:A 5XB5_1:A 5XBH_1:A
4S2C_1:A 4S2B_1:A
4S2R_1:A

So I want the output to have the first column and the ones linked to it separated by a space on one line and to have the formed clusters in a decreased order.

I would like to use awk to do this.

I tried using this:

awk -F '\t' '{print $1*" "$2}'

But it gives me this output:

04S2P_1:A
05DTT_1:A
07ASS_1:A
07AUX_1:A
05HAQ_1:A
05HAP_1:A
05HAR_1:A

It adds a 0 at the beginning and doesn't keep the similar sequences on the same line.

Welcome to Stack Overflow (SO). [SO is a question and answer page for professional and enthusiast programmers](https://stackoverflow.com/tour). Please add your own code to your question. You are expected to show at least the amount of research you have put into solving this question yourself. — Cyrus, Nov 09 '21 at 10:09
Why would you like to use `awk` for this? Technically you can — append `" " $2` to `some_array[$1]` as you read the file. But the very same thing can be achieved using associative arrays directly in Bash, i.e. `declare -A some_array` etc. — Andrej Podzimek, Nov 09 '21 at 10:35
If you want to solve it with `awk` instead of `bash` you should tag your question with "awk". — ceving, Nov 09 '21 at 12:11
is the file already sorted by the 1st column? other than sorting the output by the number of fields, are there any other sorting requirements for the output ... either between fields on the same line, or between lines with the same number of fields? how big is your largest input file (MBytes? number of lines?) — markp-fuso, Nov 09 '21 at 12:40

score 1 · Answer 1 · answered Nov 09 '21 at 11:01

Typically a hash is used to make a list unique.

#! /bin/bash

declare -A hash

while read -r c1 c2; do
  hash[$c1]+=$'\t'"$c2"
done

for key in "${!hash[@]}"; do
  printf '%s%s\n' "$key" "${hash[$key]}"
done

The disadvantage is, that you loose the original sort order. But it seems to me that you do not care about the original order. If you want to sort the output by the length of each line, you can take one of the answers to that question.

score 0 · Accepted Answer · answered Nov 09 '21 at 11:47

Here is a simple Awk script to lift values with the same key to the same line.

awk '$1 != prev { if(prev) printf "\n";
     prev=$1; printf "%s", $2; next }
   { printf " %s", $2 }
  END { if (prev) printf "\n" }' file

To sort by the length of each record, you will need to keep things in memory while reading. The above is attractive for its simplicity and robustness (should work for files of any size) but we can make it a little bit more involved to print a sort key in front of each line, at the cost of needing to keep each complete record in memory until we know its length.

awk 'function pr () { printf "%i\t", n; printf "%s", a[1];
    for(i=2; i<=n; ++i) printf " %s", a[i];
    printf "\n"; delete a; n=0 }
  $1 != prev { if (prev) pr(); prev=$1; a[1]=$2; n=1; next }
  { a[++n] = $2 }
  END { if (n) pr() }' file |
sort -t $'\t' -k1rn |
cut -f2-

Or we can just keep everything in memory, remove the assumption that keys in the first column are grouped and keep the script trivial: `awk '{keys[$1] = keys[$1] " " $2} END {for (key in keys) print key, substr(keys[key], 2)}'` — Andrej Podzimek, Nov 09 '21 at 13:54

How to concatenate lines based on one line in common?

2 Answers2