1

I am trying to build a regex which will match every line which doesn't contain the word "stylesheet" in it, and has an "a href" which has a value NOT starting with http or www.

This is how far I got, but it doesn't seem to do what I want:

grep -rin "href=\"\/*\/*\/|^((?!stylesheet).)*$" *.html

The goal is that this would be caught:

<a href="/api_supplier/">
<a href="/other-internal-link/abc/">

but this wouldn't:

<a href="http://github.com/">
<a href="www.github.com/index.html">
<a href="/other-internal-link/test/" rel="stylesheet">

The ultimate goal of mine is to append "index.html" at the end of every internal link, so they would look like this:

<a href="/api_supplier/index.html">
<a href="/other-internal-link/abc/index.html">
Toto
  • 89,455
  • 62
  • 89
  • 125
Macskasztorik
  • 649
  • 2
  • 7
  • 17

2 Answers2

0

This regex may do the job :

^(.*a href)((?!http|www|stylesheet).)*$
Gosfly
  • 1,240
  • 1
  • 8
  • 14
  • I'm not familiar with all flavors of regexes. For example grep would work on single lines anyway so I assume the `^` and `$` anchors would be redundant (and consequently the `.*` following the `^`). Do you need them e.g. in Perl? – Peter - Reinstate Monica Jan 20 '20 at 10:19
  • This doesn't work with: `` – Toto Jan 20 '20 at 10:26
  • I think you are right about the usefulness of ^ and $ when using grep but the .* would still be mandatory in my opinion. It's here to catch anything before "a href" which may be nothing else than a simple "<". Unfortunately, I don't know anything about Perl so I couldn't help you here but they are still useful in many situations when you handle multiline strings for example when you are looking for a specific pattern per line. – Gosfly Jan 20 '20 at 10:28
  • @Toto Hum, what do you mean ? He doesn't want lines with stylesheet in it so it will be ignored as expected, right ? – Gosfly Jan 20 '20 at 10:30
  • No it will not. You are testing absence of `steelesheet` **only after** `href` – Toto Jan 20 '20 at 10:33
  • @Peter-ReinstateMonica Edit, I think ^ and $ are always needed in this case, even when using grep which work on single lines because we want to match from the beginning to the end of the line and we need to tell that with the regex. Maybe someone could confirm because i am not 100% sure. – Gosfly Jan 20 '20 at 10:35
  • @Toto only after "a href" which makes a huge difference – Gosfly Jan 20 '20 at 10:36
0

A perl way that append index.html to the right urls:

~cat file.txt 
<a href="/api_supplier/">
<a href="/other-internal-link/abc/">

<a href="http://github.com/">
<a href="www.github.com/index.html">
<a href="/other-internal-link/test/" rel="stylesheet">

~perl -ape 's~^(?!.*stylesheet).*?\bhref="/[^"]+\K~index.html~' file.txt 
<a href="/api_supplier/index.html">
<a href="/other-internal-link/abc/index.html">

<a href="http://github.com/">
<a href="www.github.com/index.html">
<a href="/other-internal-link/test/" rel="stylesheet">

If you want to do the replacement in-place, use the -i option:

perl -i -ape 's~^(?!.*stylesheet).*?\bhref="/[^"]+\K~index.html~' file.txt
Toto
  • 89,455
  • 62
  • 89
  • 125