0

Of all my googling I can't seem to find how this can be done (plus I'm such a newb when it comes to xslt). I'm trying to take a flat sitemap.xml file and have nested children if the url path dictates.

Sample sitemap

<sitemap>
<url>
    <loc>https://example.com/</loc>
    <lastmod>2022-02-28</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
</url>
<url>
    <loc>https://example.com/athletics/index.html</loc>
    <lastmod>2022-02-28</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
</url>
<url>
    <loc>https://example.com/athletics/colleges-schools.html</loc>
    <lastmod>2022-02-28</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
</url>
<url>
    <loc>https://example.com/training/index.html</loc>
    <lastmod>2022-02-28</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
</url>
<urls>....</urls>
</sitemap>

I either want to ultimately have it something like this

<sitemap>
<url>
    <loc>https://example.com/</loc>
    <lastmod>2022-02-28</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
    <children>
        <url>
            <loc>https://example.com/athletics/index.html</loc>
            <lastmod>2022-02-28</lastmod>
            <changefreq>monthly</changefreq>
            <priority>0.5</priority>
            <children>
                <url>
                    <loc>https://example.com/athletics/colleges-schools.html</loc>
                    <lastmod>2022-02-28</lastmod>
                    <changefreq>monthly</changefreq>
                    <priority>0.5</priority>
                </url>
            </children>
        </url>
        <url>
            <loc>https://example.com/training/index.html</loc>
            <lastmod>2022-02-28</lastmod>
            <changefreq>monthly</changefreq>
            <priority>0.5</priority>
        </url>
    </children>
</url>
</sitemap>

or converted to a .json, but I have learned how to convert an xml to json quite easily, so I just really need to understand how I can regex (or something) into the <loc> field and generate children based of /XX_name_XX/

Bobby
  • 13
  • 4
  • In all your questions regarding XSLT, please state which version of XSLT your processor supports. If you're not sure, see here how to find out: https://stackoverflow.com/a/25245033/3016153 – michael.hor257k May 20 '22 at 16:20
  • 1
    The logic of your result is not clear: why is `https://example.com/training/index.html` not a child of `https://example.com/`? – michael.hor257k May 20 '22 at 16:25
  • you're right @michael.hor257k, it should be a child. I'll update – Bobby May 20 '22 at 17:21
  • re: version of XSLT: I'm not really sure I know that. I'm ultimately going to convert it to a json object and will do all this work locally on my machine. @michael.hor257k – Bobby May 20 '22 at 17:29
  • This is not exactly a trivial task and knowing which version of XSLT you can use is rather crucial. For example, you mentioned regex: I am not entirely sure regex is required here, but certainly there is no support for regex in XSLT 1.0. – michael.hor257k May 20 '22 at 17:46
  • 1
    I also think the logic needs to be better defined: I don't see why `https://example.com/athletics/colleges-schools.html` is a child of `https://example.com/athletics/index.html`. The way I see it they are both children of `https://example.com/athletics/`. Perhaps even more importantly, there is no `url` with a `loc` of `https://example.com/athletics/` - which means there is no simple way to move recursively from one level of hierarchy to the next. All the more reason to know which XSLT version can be used here. – michael.hor257k May 20 '22 at 17:46
  • @michael.hor257k Not knowing enough about XSLT and the best versions, which would you recommend? I'm not sure I know the limitations of implementing a lower version. Correct, there is no `url` with the location of `https://example.com/athletics/` as most of these urls at the root are `/index.html`. There are _some_ urls that end with `/`, but these are few with this xml file. – Bobby May 20 '22 at 18:13
  • I recommend you use a processor that supports the latest version, of course. I am rather perplexed by your question: you are asking about XSLT code; how do you intend to run it, once you have it? – michael.hor257k May 20 '22 at 18:32

1 Answers1

0

Searching for "xslt recursive grouping" may give you some ideas.

This is definitely going to be a lot easier with XSLT 2.0 or 3.0, which have built-in support for regex processing, tokenization, and grouping.

Having said that, it's still going to be challenging, and isn't a task I would recommend for your first foray into XSLT programming. But it's not very useful to say "If I were you, I wouldn't start from here".

The general approach is that you want to group URLs that are the same in the first N tokens (where tokens derive from splitting the URL at a slash); and within each such group, you want to divide it into subgroups that are the same in the first N+1 tokens by means of a recursive call. So it looks something like this:

<xsl:function name="f:grouping-key" as="xs:string">
  <xsl:param name="url" as="xs:string"/>
  <xsl:param name="level" as="xs:integer"/>
  <xsl:sequence select="tokenize($url, '/') 
                        => subsequence(1, $level)
                        => string-join('/')"/>
</xsl:function>

<xsl:function name="f:group-urls">
  <xsl:param name="urls" as="element(url)*"/>
  <xsl:param name="level" as="xs:integer"/>
  <xsl:for-each-group select="$urls"
                      group-by="f:grouping-key(loc, $level)">
    <xsl:choose>
      <xsl:when test="count(current-group()) = 1">
        <xsl:copy-of select="current-group()"/>
      </xsl:when>
      <xsl:otherwise>
        <url>
           <loc>{current-grouping-key()}</loc>
           <xsl:copy-of select="lastmod, changefreq, priority"/>
           <children>
             <xsl:sequence select="f:group-urls(current-group(), $level+1"/>
           </children>
        </url>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:for-each-group>
</xsl:function>

and then you fire it off with a call to f:group-urls(//url, 1).

Not tested.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164