I have a tab separated file that looks like this:
4S2P_1:A 4S2P_1:A
4S2P_1:A 6PXX_1:A
4S2P_1:A 6HB8_1:A
4S2P_1:A 6HOO_1:A
4S2P_1:A 6I5D_1:A
4S2R_1:A 4S2R_1:A
4S2C_1:A 4S2C_1:A
4S2C_1:A 4S2B_1:A
4S2E_1:A 4S2E_1:A
4S2E_1:A 5XB5_1:A
4S2E_1:A 5XBH_1:A
The file is created so that in the second column are the sequences similar to the ones in the first column. 4S2P_1:A is similar to itself and 6Q5B_1:A and 6PXX_1:A and 6HB8_1:A and so on. 4S2R_1:A is just similar to itself.
I want to parse the file to look like this:
4S2P_1:A 6PXX_1:A 6HB8_1:A 6HOO_1:A 6I5D_1:A
4S2E_1:A 5XB5_1:A 5XBH_1:A
4S2C_1:A 4S2B_1:A
4S2R_1:A
So I want the output to have the first column and the ones linked to it separated by a space on one line and to have the formed clusters in a decreased order.
I would like to use awk to do this.
I tried using this:
awk -F '\t' '{print $1*" "$2}'
But it gives me this output:
04S2P_1:A
05DTT_1:A
07ASS_1:A
07AUX_1:A
05HAQ_1:A
05HAP_1:A
05HAR_1:A
It adds a 0 at the beginning and doesn't keep the similar sequences on the same line.