3

I have a 200 pages site and would like to implement the canonicalization of links.

I use my ftp client to download the site into a local directory and would like to have the canonical meta tag right under the <head> tag for each page.

So, for page 1, i would like to transform

<head>

into

<head>
<link rel="canonical" href="http://www.site.com/page1.htm" />

and use sed to do it within the whole local directory (page1.htm, page2.htm... page200.htm). Thank you.

Sergiof4
  • 177
  • 6
  • I am afraid it was the wrong filename example: the pages have all kind of different names, totally random. The canonical meta tag should match www.site.com/* , not just page1,2 etc. – Sergiof4 Nov 05 '13 at 19:56

2 Answers2

2

sed, awk are not designed to treat HTML. See RegEx match open tags except XHTML self-contained tags

Demo using , ,

cd /where/HTML_pages/exists
for file in *html; do xmlstarlet transform --html <(cat<<EOF
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >
    <xsl:output method="html" encoding="utf-8"/>
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" />
        </xsl:copy>
    </xsl:template>
     <xsl:template match="head">
         <xsl:copy>
             <xsl:apply-templates/>
             <xsl:if test="not(link)">
                 <link rel="canonical" href="http://www.site.com/$file" />
             </xsl:if>
         </xsl:copy>
     </xsl:template>
 </xsl:stylesheet>
EOF) >/"tmp/$file" "$file" && mv "/tmp/$file" "$file"
done

Edit

an even better/proper pure solution still using but now is no more mandatory :

file xsl.xslt :

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
   <xsl:output method="html" encoding="utf-8" />
   <!-- where are not making a HTML from scratch,
         so we will copy what's exists -->
   <xsl:template match="@*|node()">
      <xsl:copy>
         <xsl:apply-templates select="@*|node()" />
      </xsl:copy>
   </xsl:template>
   <!-- looking for "head" tag -->
   <xsl:template match="head">
      <xsl:copy>
         <xsl:apply-templates />
         <!-- if "link" tag not exists ... -->
         <xsl:if test="not(link)">
            <!-- we add the new "link" tag... -->
            <link>
               <xsl:attribute name="rel">
                  <!-- with a fixed string attribute... -->
                  <xsl:text>canonical</xsl:text>
               </xsl:attribute>
               <xsl:attribute name="href">
                  <!-- and a dynamic string attribute ("link" parameter) -->
                  <xsl:value-of select="$link" />
               </xsl:attribute>
            </link>
         </xsl:if>
      </xsl:copy>
   </xsl:template>
</xsl:stylesheet>

code :

cd /where/HTML_pages/exists
for file in *html; do
    xmlstarlet transform \
        --html \
        xsl.xslt \
        -s "link=http://www.site.com/$file" "$file" > "/tmp/$file" &&
            mv "/tmp/$file" "$file"
done

That will add the element you want in <head> with the current page as variable

Community
  • 1
  • 1
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • Added full OP expected solution with dynamic filenames and in place editing. – Gilles Quénot Nov 05 '13 at 21:12
  • I am trying to use the second - purest - solution (i won't parse html with regex, thanks for the advice) but it doesnt seem to work. I saved the `xsl.xslt` file in the same directory where the html files are stored and changed the first line `for file in *html; do` into `for file in *htm; do` and i saw the shell going through the pages, but they were not modified. – Sergiof4 Nov 06 '13 at 09:52
  • IS there some .html files in your directory ? Is the xslt file is named exactly like in the code ? Is there some errors ? – Gilles Quénot Nov 06 '13 at 10:51
  • In my directory there are many files with .htm extension. There are NOT files with .html extension. Yes, the file is named exactly like the code. – Sergiof4 Nov 06 '13 at 13:16
  • I asked you if you have errors. Is `xmlstarlet` installed ? Try just one one file like this : `xmlstarlet transform --html xsl.xslt -s "link=http://www.site.com/page1.htm" page1.htm` – Gilles Quénot Nov 06 '13 at 13:23
  • I tried with a very basic page and the example in your last comment worked (i had the right output in my shell, but it didn't transform the page itself). Please let me copy the output with errors i get when i try to accomplish the main task: ~/Desktop/c $ for file in *htm; do xmlstarlet transform --html xsl.xslt -s "link=http://www.site.com/$file" "$file" > "/tmp/$file" && mv "/tmp/$file" "$file"; done 410.htm:156.18: ID block already defined
    ^ accedere.htm:21.9: Tag article invalid

    – Sergiof4 Nov 07 '13 at 16:39
  • Seems just like warnings. When you redirect output with `> file`, you feed the file. Just basics. – Gilles Quénot Nov 07 '13 at 18:38
0

I solved the problem for myself in 2 stages:

  1. find ./ -name '*.html' | while read i; do echo $i; sed -i 's#</head>#<link rel="canonical" href="'$i'" />\n</head>#I' ./$i;done

2.find ./ -name '*.html' | while read i; do echo $i; sudo sed -i 's#<link rel="canonical" href=".#<link rel="canonical" href="http://domainname.here#g' ./$i;done;