0

I have a huge table at which I am trying to change some duplicated column names using sed with first match replacement. For that, I am using an array with the duplicated column names, which I selected manually.

I first tried the sed code with one simple text string, and it worked:

sed '0,/AGE_032/ s//AGE_032.old/' combined.order.allfilter.abund.tsv | head -n1

Then, I tried to replace matches with an isolated element of the array and it is not working.

declare -a oldarr=("AGE_032" "MOLI_032" "OIA_013" "SH-108" "SH-16")
sed '0,/${oldarr[0]}/ s//${oldarr[0]}.old/' combined.order.allfilter.abund.tsv | head -n1

The expected output should be something like this:

AGE_023 AGE_024 AGE_025 AGE_026 AGE_027 AGE_028 AGE_029 AGE_030 AGE_031 
AGE_032.old MOLI_029 MOLI_030 MOLI_031 MOLI_032 MOLI_033 SH-107  OIA_013 
SH-108 SH-109 SH-110 SH-13 SH-15 SH-16 SH-17 AREN_36 AREN_38 AREN_39 
AGE_032 MOLI_032 OIA_013 SH-108 SH-16

Note that AGE_032, MOLI_032, OIA_013, SH-108 and SH-16 appear twice, and only the first match of AGE_032should be replaced with AGE_032.old.

Of course, any other code solution for solving the problem will be appreciated.

Clarification: the code must work for replacing the first match of every string inside the array.

ALG
  • 181
  • 1
  • 11
  • Possible duplicate of [Difference between single and double quotes in Bash](https://stackoverflow.com/questions/6697753/difference-between-single-and-double-quotes-in-bash) – KamilCuk Nov 12 '19 at 15:29
  • Is it always first match of only string `AGE_032` or for first match for all strings you want to substitute? Kindly clarify the same. – RavinderSingh13 Nov 12 '19 at 15:29
  • Using `sed` for this sounds ... awkward. How did you extract the array in the first place? This sounds a lot like a job for Awk. – tripleee Nov 12 '19 at 15:29
  • This is for the first match of all strings inside the array. I used sed because it came to my mind, but if awk provides a better solution it would be great to know. – ALG Nov 12 '19 at 15:31
  • `awk '{ for (i=1; i<=NF; ++i) if($0 ~ "(^| )" $i " .* " $i "( |$)" && a[$i]++==0) $i = $i ".old" } 1` replaces the first one if there are two of any token. It needs rework if you can have more than one duplicate or need matching to not straddle lines. – tripleee Nov 12 '19 at 16:32
  • Of course, `perl -pe 's/\b(\w+)\b(?=.*\b\1\b)/$1.old/g'` is even more succinct, and handles all duplicates on a line if there are several, with no bleed between lines. – tripleee Nov 12 '19 at 16:58
  • But do you mean that the code should only modify the first line? That's easy per se, but a significant additional requirement. – tripleee Nov 12 '19 at 17:00

1 Answers1

1

Pulling the columns out into a Bash array seems like a very roundabout way of doing this. A simple Awk or Perl script can examine the column headers and write them out in one go. Here's a Perl one-liner to rename headers on the first line and write the result back to the original file name:

perl -i~ -pe 's/\b(\w+-\d+)\b(?=.*\b\1\b)/$1.old/g if $.==1' combined.order.allfilter.abund.tsv

The regular expression will successively find tokens which occur at least twice on the first line of the file, and replace all except the last one with the original token with ".old" appended.

In some more detail, the regular expression looks for a word boundary (\b) before and after a label matching \w+-\d+. The parentheses capture this label and we use a lookahead (?=...) to see if it occurs again between similar separators further to the right; the \1 matches the first captured string again.

The postfix condition if $.==1 limits this to the first line of the file.

The option -i~ will create a backup file with a tilde appended to its name; once you are confident that this works, you can take it out if you don't want a backup file to be written.

tripleee
  • 175,061
  • 34
  • 275
  • 318