How could I append 'index.html' to all links in a html file that do not end with that word ?
So that, for example, href="http://mysite/"
would become href="http://mysite/index.html"
.
How could I append 'index.html' to all links in a html file that do not end with that word ?
So that, for example, href="http://mysite/"
would become href="http://mysite/index.html"
.
I am not a sed expert, but think this works:
sed -e "s_\"\(http://[^\"]*\)/index.html\"_\"\1\"_g" \
-e "s_\"\(http://[^\"]*[^/]\)/*\"_\"\1/index.html\"_g"
The first replacement finds URLS already ending in /index.html
and deletes this ending.
The second replacement adds the /index.html
as required. It deals with cases that end in /
and also those that don't.
More than one version of sed exists. I'm using the one that comes in XCode for OS X.
What about this:
echo 'href="http://mysite/"' | awk '/http/ {sub(/\/\"/,"/index.html\"")}1'
href="http://mysite/index.html"
echo 'href="http://www.google.com/"' | awk '/http/ {sub(/\/\"/,"/index.html\"")}1'
href="http://www.google.com/index.html"
for href ending with /
sed '\|href="http://.*/| s||\1index.html' YourFile
if there is folder ref without ending /, you should specifie what is consider as a file or not (like last name with a dot inside for file, ...)
In general this is an almost unsolvable problem. If your html is "reasonably well behaved", the following expression searches for things that "look a lot like a URL"; you can see it at work at http://regex101.com/r/bZ9mR8 (this shows the search and replace for several examples; it should work for most others)
((?:(?:https?|ftp):\/{2})(?:(?:[0-9a-z_@-]+\.)+(?:[0-9a-z]){2,4})?(?:(?:\/(?:[~0-9a-z\#\+\%\@\.\/_-]+))?\/)*(?=\s|\"))(\/)?(index\.html?)?
The result of the above match should be replaced with
\1index.html
Unfortunately this requires regex wizardry that is well beyond the rather pedestrian capabilities of sed
, so you will have to unleash the power of perl
, as follows:
perl -p -e '((?:(?:https?|ftp):\/{2})(?:(?:[0-9a-z_@-]+\.)+(?:[0-9a-z]){2,4})?(?:(?:\/(?:[~0-9a-z\#\+\%\@\.\/_-]+))?\/)*(?=\s|\"))(\/)?(index\.html?)?/\index.html/gi'
It looks a bit daunting, I know. But it works. The only problem - if a link ends in /
, it will add /index.html
. You could easily take the output of the above and process it with
sed 's/\/\/index.html/\/index.html/g'
To replace a double-backslash-before-index.html with a single backslash...
Some examples (several more given in the link above)
http://www.index.com/ add /index.html
http://ex.com/a/b/" add /index.html
http://www.example.com add /index.html
http://www.example.com/something do nothing
http://www.example.com/something/ add /index.html
http://www.example.com/something/index.html do nothing