3

I want to write a shell script that gets all "a href" HTML tags from provided link and prints them to the console. The problem I am facing right now is removing all of the text I don't need between them. After some googling I came to a conclusion, that the "sed" command would be the best for this job, however, I cannot figure out how to write it correctly

#!/bin/sh
wget -qO - $1 | grep -E "*<[Aa]([[:print:]])*( |'\n')[Hh][Rr][Ee][Ff]([[:print:]])*</a" | sed 's/<\/a>.*<a/<\/a>REPLACED\n<a/g'

What I am trying to do is to replace EVERYTHING between the "</a>" closing tag and the next "<a" opening tag (I don't know much about HTML, but there may be other tags that have "a" as opening and closing, but that's a problem for later), however, this (and a few different ways I have tried) only works sometimes.

I am new to shell scripting, so any suggestions are welcome, maybe "sed" is not the command for the job, hope you can help me, thanks in advance

Edit 1: from this:

<a href="http://www.canonical.com">Canonical</a></li></ul></li></ul></div></div> <script> $(function() { $(".nav-global .more > a").click(function(e){ $(this).closest(".more").toggleClass("open"); return false; }); $(document).click(function(){ $(".nav-global .more.open").removeClass("open"); }); }); </script></div>
            <a href="#" class="s-topbar--menu-btn js-left-sidebar-toggle" role="menuitem" aria-haspopup="true" aria-controls="left-sidebar" aria-expanded="false"><span></span></a>

to this:

<a href="http://www.canonical.com">Canonical</a>REPLACED<a href="#" class="s-topbar--menu-btn js-left-sidebar-toggle" role="menuitem" aria-haspopup="true" aria-controls="left-sidebar" aria-expanded="false"><span></span></a>

Edit 2: It seems I am bad at explaining exactly what I expect. For large-scale testing, I use the link https://askubuntu.com/questions/726076/whats-wrong-with-my-grep-command. What I am trying to achieve is to have ONLY "a href" (or other HTML tags that start with "<a" and end with "</a>") separated by "REPLACED" as shown in previous edit

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • 2
    Please provide sample data and expected output. – HatLess Sep 09 '22 at 14:43
  • 2
    It is one more funny bet I make (and win) each time I stumble upon a question about matching a string between two other strings, or matching multiple lines: Someone is trying to force the wrong tool to parse some markup language, like html using Bash, Sed, Awk, Regexe... – Léa Gris Sep 09 '22 at 15:40
  • 2
    @LéaGris Indeed. However, none of the answers in the linked duplicate address this question. Perhaps search for a better duplicate? – HatLess Sep 09 '22 at 18:17

3 Answers3

4

1st solution: With your shown samples please try following awk code. Written and tested in GNU awk.

awk -v RS="" -v FS='<\\/a>.*<a href=' '{print $1"</a>REPLACED<a href="$2}' Input_file

2nd solution: Using RS and sub functions of awk, written and tested in GNU awk.

awk -v RS="" '{sub(/<\/a>.*<a href=/,"</a>REPLACED<a href=")} 1' Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
2

Using sed

$ sed -Ez 's~(<[^<]*)[^\n]*\n +~\1</a>REPLACED~' input_file
<a href="http://www.canonical.com">Canonical</a>REPLACED<a href="#" class="s-topbar--menu-btn js-left-sidebar-toggle" role="menuitem" aria-haspopup="true" aria-controls="left-sidebar" aria-expanded="false"><span></span></a>
HatLess
  • 10,622
  • 5
  • 14
  • 32
0

Output result to stdout:

sed -z 's/\(<\/a>\).*\(<a\)/\1REPLACED\2/g' inputfile
wangloo
  • 27
  • 7