0

I am using WGET to burn a static copy of a PHP website. I would like to remove all references to .html in every <a href for every file. So any link for example, <a href="path/project-name.html">Project Name</a>, I would like changed to <a href="path/project-name">Project Name</a>.

The command grep -rl index.html . | xargs sed -i 's/index.html//g' works great at removing every index.html in all the links.

But I can't get it to work for every .html link with the command grep -rl *.html . | xargs sed -i 's/*.html//g'.

Any assistance with my regex would be much appreciated.

spencerthayer
  • 77
  • 1
  • 7

1 Answers1

1

's/*.html//g' is wrong since you entered a glob pattern in the LHS (left hand side part of the sed substitution command), where a regex pattern is expected.

You can use

grep -rl *.html . | xargs sed -i -E 's/(href="[^"]*)\.html"/\1"/g'

Details:

  • -E - the option that enables the POSIX ERE regex syntax
  • (href="[^"]*)\.html" - matches and captures into Group 1 (later, accessed via \1 backreference) a href=" substring and any zero or more chars other than " after, and then just matches .html substring
  • \1" - replaces with Group 1 and a " char
  • g - all non-overlapping occurrences on a line.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • So this works and does remove the .html extension from all HREF links but for some reason, this isn't recursively going through every sub directory. – spencerthayer Jul 26 '21 at 15:21
  • 1
    @spencerthayer Try `find . -type f -name '*.html' -print0 | xargs -0 sed -i -E 's/(href="[^"]*)\.html"/\1"/g'`, see [How to do a recursive find/replace of a string with awk or sed?](https://stackoverflow.com/a/1583229/3832970). – Wiktor Stribiżew Jul 26 '21 at 15:23