Replacing everything between two strings in UNIX shell

Question

I want to write a shell script that gets all "a href" HTML tags from provided link and prints them to the console. The problem I am facing right now is removing all of the text I don't need between them. After some googling I came to a conclusion, that the "sed" command would be the best for this job, however, I cannot figure out how to write it correctly

#!/bin/sh
wget -qO - $1 | grep -E "*<[Aa]([[:print:]])*( |'\n')[Hh][Rr][Ee][Ff]([[:print:]])*</a" | sed 's/<\/a>.*<a/<\/a>REPLACED\n<a/g'

What I am trying to do is to replace EVERYTHING between the "</a>" closing tag and the next "<a" opening tag (I don't know much about HTML, but there may be other tags that have "a" as opening and closing, but that's a problem for later), however, this (and a few different ways I have tried) only works sometimes.

I am new to shell scripting, so any suggestions are welcome, maybe "sed" is not the command for the job, hope you can help me, thanks in advance

Edit 1: from this:

<a href="http://www.canonical.com">Canonical</a></li></ul></li></ul></div></div> <script> $(function() { $(".nav-global .more > a").click(function(e){ $(this).closest(".more").toggleClass("open"); return false; }); $(document).click(function(){ $(".nav-global .more.open").removeClass("open"); }); }); </script></div>
            <a href="#" class="s-topbar--menu-btn js-left-sidebar-toggle" role="menuitem" aria-haspopup="true" aria-controls="left-sidebar" aria-expanded="false"><span></span></a>

to this:

<a href="http://www.canonical.com">Canonical</a>REPLACED<a href="#" class="s-topbar--menu-btn js-left-sidebar-toggle" role="menuitem" aria-haspopup="true" aria-controls="left-sidebar" aria-expanded="false"><span></span></a>

Edit 2: It seems I am bad at explaining exactly what I expect. For large-scale testing, I use the link https://askubuntu.com/questions/726076/whats-wrong-with-my-grep-command. What I am trying to achieve is to have ONLY "a href" (or other HTML tags that start with "<a" and end with "</a>") separated by "REPLACED" as shown in previous edit

It is one more funny bet I make (and win) each time I stumble upon a question about matching a string between two other strings, or matching multiple lines: Someone is trying to force the wrong tool to parse some markup language, like html using Bash, Sed, Awk, Regexe... — Léa Gris, Sep 09 '22 at 15:40
@LéaGris Indeed. However, none of the answers in the linked duplicate address this question. Perhaps search for a better duplicate? — HatLess, Sep 09 '22 at 18:17

RavinderSingh13 · Accepted Answer · 2022-09-09T15:39:26.270

4

1st solution: With your shown samples please try following awk code. Written and tested in GNU awk.

awk -v RS="" -v FS='<\\/a>.*<a href=' '{print $1"</a>REPLACED<a href="$2}' Input_file

2nd solution: Using RS and sub functions of awk, written and tested in GNU awk.

awk -v RS="" '{sub(/<\/a>.*<a href=/,"</a>REPLACED<a href=")} 1' Input_file

edited Sep 09 '22 at 15:39

answered Sep 09 '22 at 15:18

RavinderSingh13

130,504
14
57
93

score 2 · Answer 2 · answered Sep 09 '22 at 15:03

Using sed

$ sed -Ez 's~(<[^<]*)[^\n]*\n +~\1</a>REPLACED~' input_file
<a href="http://www.canonical.com">Canonical</a>REPLACED<a href="#" class="s-topbar--menu-btn js-left-sidebar-toggle" role="menuitem" aria-haspopup="true" aria-controls="left-sidebar" aria-expanded="false"><span></span></a>

score 0 · Answer 3 · answered Sep 10 '22 at 14:25

0

Output result to stdout:

sed -z 's/\(<\/a>\).*\(<a\)/\1REPLACED\2/g' inputfile

answered Sep 10 '22 at 14:25

wangloo

27
7

Replacing everything between two strings in UNIX shell

3 Answers3