1

I'd like to remove certain hyperlinks which all contain "legacy/" in the URL in many HTML files. However, some of them are in one line

<a href=".../legacy/..."> ... </a>\n

while others are not. How can I use sed to replace them all at one time?

So far I've tried

sed -ri 's/(.+legacy\/[[:print:]]+<\/a>.*$)/<!--\1-->/g' wave-on-a-string.html 

which only replaces hyperlink in one line. I then realized that sed read one line at a time only. However, I couldn't find out how to matches multi (uncertain number of) lines hyperlink block.

The HTML files have some contents like this:

      <a class="other-sim-page" href="legacy/wave-on-a-string.html" dir="ltr">
        <table>
          <tr>
            <td>
              <img style="display: block;" src="../../images/icons/sim-badges/flash-badge.png" alt="Flash Logo" width="44" height="44">
            </td>
            <td>
              <span class="other-sim-link">原始模擬教學與翻譯</span>
            </td>
          </tr>
        </table>
      </a>

...

          <p>瀏覽<a href="legacy/wave-on-a-string.html#for-teachers-header">更多活動</a>。</p>

...

                    <a href="legacy/radiating-charge.html" class="simulation-link">

                      <img class="simulation-list-thumbnail" src="../../sims/radiating-charge/radiating-charge-128.png" id="simulation-display-thumbnail-radiating-charge" alt="Screenshot of the simulation 電荷輻射" width="128" height="84"/><br/>
                        <strong><span class="simulation-list-title">電荷輻射</span></strong><br/>
                        <span class="sim-display-badge sim-badge-flash"></span>
                    </a>

...

and it only matches and replaces the second hyperlink since it is in one line.

I'd like to replace all the hyperlink blocks (<a href="..."> ... </a>) also if they stretch over several lines.

Allan
  • 12,117
  • 3
  • 27
  • 51
Franklin
  • 21
  • 3

3 Answers3

1

With GNU sed for -z and using all 3 blocks of input you provided together in one file as input:

$ sed -z '
    s:@:@A:g; s:}:@B:g; s:</a>:}:g;
    s:<a[^<>]* href="legacy/[^}]*}:<!--&-->:g;
    s:}:</a>:g; s:@B:}:g; s:@A:@:g
' file
      <!--<a class="other-sim-page" href="legacy/wave-on-a-string.html" dir="ltr">
        <table>
          <tr>
            <td>
              <img style="display: block;" src="../../images/icons/sim-badges/flash-badge.png" alt="Flash Logo" width="44" height="44">
            </td>
            <td>
              <span class="other-sim-link">原始模擬教學與翻譯</span>
            </td>
          </tr>
        </table>
      </a>-->

...

          <p>瀏覽<!--<a href="legacy/wave-on-a-string.html#for-teachers-header">更多活動</a>-->。</p>

...

                    <!--<a href="legacy/radiating-charge.html" class="simulation-link">

                      <img class="simulation-list-thumbnail" src="../../sims/radiating-charge/radiating-charge-128.png" id="simulation-display-thumbnail-radiating-charge" alt="Screenshot of the simulation 電荷輻射" width="128" height="84"/><br/>
                        <strong><span class="simulation-list-title">電荷輻射</span></strong><br/>
                        <span class="sim-display-badge sim-badge-flash"></span>
                    </a>-->

The first line turns } into a character than can't be present in the input afterwards by converting all }s to @Bs and then turns all </a>s into } so that char can be negated in a bracket expression as [^}] in the regexp for the string you want to replace, the second line does the actual replacement you want, and the third line restores all }s to </a>s and then @Bs to }s.

Manipulating the input to create a char that can't exist in the input is a fairly common sed idiom to work around not being able to negate strings in regexps. See https://stackoverflow.com/a/35708616/1745001 for another example with additional explanation.

This will of course fail if you have strings in your input that resemble the strings you're trying to match but in reality it's probably good enough for your specific input - you'll just have to think about what it does and check it's output to verify.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

You are not using the proper tool for this task.

sed is a great tool to perform find and replace using regex, however regex (based on DFA) are unable to parse nested structures like JSON or XML trees (as there is no limit to the depth of the nesting). I would therefore recommend using a XML/HTML parser.

For example you can use XSLT:

Input:

$ cat webpage.html 
<!DOCTYPE html>
<html>
    <body>
        <h1>My First Heading</h1>
        <p>My first paragraph.</p>
        <a href="https://www.w3schools.com">Visit W3Schools</a>
                <p>My second paragraph.</p>
        <a href="legacy/radiating-charge.html" class="simulation-link">
            <img class="simulation-list-thumbnail" src="../../sims/radiating-charge/radiating-charge-128.png" id="simulation-display-thumbnail-radiating-charge" alt="Screenshot of the simulation 電荷輻射" width="128" height="84"/><br/>
            <strong><span class="simulation-list-title">電荷輻射</span></strong><br/>
            <span class="sim-display-badge sim-badge-flash"></span>
        </a>
    </body>
</html>

Stylesheet:

$ cat remove_legacy.xslt 
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

   <xsl:output method="html" encoding="UTF-8" omit-xml-declaration="yes"/>

   <!-- copy the whole structure recursively -->
    <xsl:template match="@*|node()">
       <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
       </xsl:copy>
    </xsl:template>

   <!-- when you meet a tag a that contains href -->
   <xsl:template match="//a[contains(@href,'legacy')]">
     <!-- add comment starting tag -->
     <xsl:text disable-output-escaping="yes">&#xa;&lt;!--&#xa;</xsl:text>
       <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
       </xsl:copy>
     <!-- add comment ending tag -->
     <xsl:text disable-output-escaping="yes">&#xa;--&gt;&#xa;</xsl:text> 
   </xsl:template>

</xsl:stylesheet>

Output:

$ xsltproc --html remove_legacy.xslt webpage.html 
<html>
    <body>
        <h1>My First Heading</h1>
        <p>My first paragraph.</p>
        <a href="https://www.w3schools.com">Visit W3Schools</a>
                <p>My second paragraph.</p>

<!--
<a href="legacy/radiating-charge.html" class="simulation-link">
            <img class="simulation-list-thumbnail" src="../../sims/radiating-charge/radiating-charge-128.png" id="simulation-display-thumbnail-radiating-charge" alt="Screenshot of the simulation 電荷輻射" width="128" height="84"><br>
            <strong><span class="simulation-list-title">電荷輻射</span></strong><br>
            <span class="sim-display-badge sim-badge-flash"></span>
        </a>
-->

    </body>
</html>

As you can see the href that does not contain legacy is not commented.

Allan
  • 12,117
  • 3
  • 27
  • 51
  • Thanks for providing this answer. I tried it but got many errors. I guess that maybe the original HTML files are not strictly structured. – Franklin Apr 09 '19 at 05:51
  • @Franklin: could you run the command `xsltproc --html remove_legacy.xslt webpage.html` with `--html`? – Allan Apr 09 '19 at 05:55
  • @Franklin: if this does not work neither, then it is because your html files are broken... – Allan Apr 09 '19 at 05:55
  • Right, it should be the original HTML file's problem. https://pastebin.com/AqjxmDJT The first error in line 7 contains only . Line 138 is a line of javascript code and caused a lot errors. Also it has which should work for google classroom I guess. Thanks for sharing the tool anyway. – Franklin Apr 09 '19 at 06:33
0

try gnu sed

sed -E '/<a\s+.*href=.*legacy\/.*<\/a>/d; /<a\s+.*href=.*legacy\//,/<\/a>/d'  wave-on-a-string.html