0

How could I append 'index.html' to all links in a html file that do not end with that word ?

So that, for example, href="http://mysite/" would become href="http://mysite/index.html".

Jotne
  • 40,548
  • 12
  • 51
  • 55
Simhor
  • 701
  • 1
  • 6
  • 8
  • Do you expect there to be a `"` at the end of every link? – Floris Jan 14 '14 at 03:15
  • No, I just expect it to be a link, which is what I guess makes for a complicated regex. – Simhor Jan 14 '14 at 03:17
  • But you would expect the link to end in `/` ? I'm looking for anything that can be used as the condition to say "this is the link, and there was no '/index.html' ". Usually regex / sed is really the wrong tool for this kind of thing unless you have a particularly well formed file; the "general case" needs HTML parsing libraries (things like BeautifulSoup in Python work great). – Floris Jan 14 '14 at 03:43
  • Well yes then, to start with, we can assume that the link ends with /, and then I'll see if I have any broken links. – Simhor Jan 14 '14 at 03:45
  • 1
    A mantadory link to [The Answer](http://stackoverflow.com/a/1732454/45249). – mouviciel Jan 14 '14 at 08:02

4 Answers4

0

I am not a sed expert, but think this works:

sed -e "s_\"\(http://[^\"]*\)/index.html\"_\"\1\"_g" \
    -e "s_\"\(http://[^\"]*[^/]\)/*\"_\"\1/index.html\"_g"

The first replacement finds URLS already ending in /index.html and deletes this ending.

The second replacement adds the /index.html as required. It deals with cases that end in / and also those that don't.

More than one version of sed exists. I'm using the one that comes in XCode for OS X.

Gene
  • 46,253
  • 4
  • 58
  • 96
0

What about this:

echo 'href="http://mysite/"' | awk '/http/ {sub(/\/\"/,"/index.html\"")}1'
href="http://mysite/index.html"

echo 'href="http://www.google.com/"' | awk '/http/ {sub(/\/\"/,"/index.html\"")}1'
href="http://www.google.com/index.html"
Jotne
  • 40,548
  • 12
  • 51
  • 55
0

for href ending with /

sed '\|href="http://.*/| s||\1index.html' YourFile

if there is folder ref without ending /, you should specifie what is consider as a file or not (like last name with a dot inside for file, ...)

NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43
0

In general this is an almost unsolvable problem. If your html is "reasonably well behaved", the following expression searches for things that "look a lot like a URL"; you can see it at work at http://regex101.com/r/bZ9mR8 (this shows the search and replace for several examples; it should work for most others)

((?:(?:https?|ftp):\/{2})(?:(?:[0-9a-z_@-]+\.)+(?:[0-9a-z]){2,4})?(?:(?:\/(?:[~0-9a-z\#\+\%\@\.\/_-]+))?\/)*(?=\s|\"))(\/)?(index\.html?)?

The result of the above match should be replaced with

\1index.html

Unfortunately this requires regex wizardry that is well beyond the rather pedestrian capabilities of sed, so you will have to unleash the power of perl, as follows:

perl -p -e '((?:(?:https?|ftp):\/{2})(?:(?:[0-9a-z_@-]+\.)+(?:[0-9a-z]){2,4})?(?:(?:\/(?:[~0-9a-z\#\+\%\@\.\/_-]+))?\/)*(?=\s|\"))(\/)?(index\.html?)?/\index.html/gi'

It looks a bit daunting, I know. But it works. The only problem - if a link ends in /, it will add /index.html. You could easily take the output of the above and process it with

sed 's/\/\/index.html/\/index.html/g'

To replace a double-backslash-before-index.html with a single backslash...

Some examples (several more given in the link above)

http://www.index.com/                        add /index.html
http://ex.com/a/b/"                          add /index.html
http://www.example.com                       add /index.html
http://www.example.com/something             do nothing
http://www.example.com/something/            add /index.html 
http://www.example.com/something/index.html  do nothing
Floris
  • 45,857
  • 6
  • 70
  • 122