9

I want to extract around 20 element types from some SVG documents to form a new SVG. rect, circle, polygon, text, polyline, basically a set of visual parts are in the white list. JavaScript, comments, animations and external links need to go.

Three methods come to mind:

  1. Regex: I'm completely familiar with, and would rather not go there obviously.
  2. PHP DOM: Used once perhaps a year ago.
  3. XSLT: Took my first look just now.

If XSLT is the right tool for the job, what xsl:stylesheet do I need? Otherwise, which approach would you use?

Example input:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Created with Inkscape (http://www.inkscape.org/) -->
<svg xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:svg="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" version="1.1" width="512" height="512" id="svg2">
<title>Mostly harmless</title>
  <metadata id="metadata7">Some metadata</metadata>

<script type="text/ecmascript">
<![CDATA[
alert('Hax!');
]]>
</script>
<style type="text/css">
<![CDATA[ svg{display:none} ]]>
</style>

  <defs id="defs4">
    <circle id="my_circle" cx="100" cy="50" r="40" fill="red"/> 
  </defs>

  <g id="layer1">
  <a xlink:href="www.hax.ru">
    <use xlink:href="#my_circle" x="20" y="20"/>
    <use xlink:href="#my_circle" x="100" y="50"/>
  </a>
  </g>
  <text>
    <tspan>It was the best of times</tspan>
    <tspan dx="-140" dy="15">It was the worst of times.</tspan>
  </text>
</svg>

Example output. Displays exactly the same image:

<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="512" height="512">
  <defs>
    <circle id="my_circle" cx="100" cy="50" r="40" fill="red"/> 
  </defs>
  <g id="layer1">
    <use xlink:href="#my_circle" x="20" y="20"/>
    <use xlink:href="#my_circle" x="100" y="50"/>
  </g>
  <text>
    <tspan>It was the best of times</tspan>
    <tspan dx="-140" dy="15">It was the worst of times.</tspan>
  </text>
</svg>

The approximate list of keeper elements is: g, rect, circle, ellipse, line, polyline, polygon, path, text, tspan, tref, textpath, linearGradient+stop, radialGradient, defs, clippath, path.

If not specifically SVG tiny, then certainly SVG lite.

Volker E.
  • 5,911
  • 11
  • 47
  • 64
SamG
  • 93
  • 1
  • 5
  • XSLT is likely the right tool for the job. If you can provide a brief sample and describe what you want to redact or keep, you would likely get an answer with an XSLT to get you started. – Mads Hansen Feb 03 '11 at 00:48
  • Good question, +1. See my answer for a complete solution that produces exactly the wanted output and for an extensive explanation. :) – Dimitre Novatchev Feb 03 '11 at 05:38
  • Extra context: If the SVG was like a forum page, you'd naturally only allow people a small subset of HTML, otherwise all sorts of scripting and vandalism would get though. This is a shared SVG document, which is conceptually just like a web forum. – SamG Feb 03 '11 at 05:48

4 Answers4

6

This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:s="http://www.w3.org/2000/svg"
 >
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="*">
  <xsl:element name="{name()}" namespace="{namespace-uri()}">
   <xsl:copy-of select="namespace::xlink"/>

   <xsl:apply-templates select="node()|@*"/>
  </xsl:element>
 </xsl:template>

 <xsl:template match="@*">
  <xsl:attribute name="{name()}"
                 namespace="{namespace-uri()}">
   <xsl:value-of select="."/>
  </xsl:attribute>
 </xsl:template>

 <xsl:template match="s:a">
  <xsl:apply-templates/>
 </xsl:template>

 <xsl:template match=
 "s:title|s:metadata|s:script|s:style|
  s:svg/@version|s:svg/@id"/>
</xsl:stylesheet>

when applied on the provided XML document:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Created with Inkscape (http://www.inkscape.org/) -->
<svg xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:cc="http://creativecommons.org/ns#"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:svg="http://www.w3.org/2000/svg"
     xmlns:xlink="http://www.w3.org/1999/xlink"
     xmlns="http://www.w3.org/2000/svg" version="1.1"
     width="512" height="512" id="svg2">
    <title>Mostly harmless</title>
    <metadata id="metadata7">Some metadata</metadata>
    <script type="text/ecmascript"><![CDATA[ alert('Hax!'); ]]></script>
    <style type="text/css"><![CDATA[ svg{display:none} ]]></style>
    <defs id="defs4">
        <circle id="my_circle" cx="100" cy="50" r="40" fill="red"/>
    </defs>
    <g id="layer1">
        <a xlink:href="www.hax.ru">
            <use xlink:href="#my_circle" x="20" y="20"/>
            <use xlink:href="#my_circle" x="100" y="50"/>
        </a>
    </g>
    <text>
        <tspan>It was the best of times</tspan>
        <tspan dx="-140" dy="15">It was the worst of times.</tspan>
    </text>
</svg>

produces the wanted, correct result:

<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="512" height="512">
   <defs id="defs4">
      <circle id="my_circle" cx="100" cy="50" r="40" fill="red"/>
   </defs>
   <g id="layer1">
      <use xlink:href="#my_circle" x="20" y="20"/>
      <use xlink:href="#my_circle" x="100" y="50"/>
   </g>
   <text>
      <tspan>It was the best of times</tspan>
      <tspan dx="-140" dy="15">It was the worst of times.</tspan>
   </text>
</svg>

Explanation:

  1. Two templates, having combined effect that is similar to the identity rule, match all "white-listed nodes and essentially copy them (only eliminating unwanted namespace nodes).

  2. A template with no body matches all "black-listed" nodes (elements and some attributes). These are effectively deleted.

  3. There must be templates that match specific "grey-listed" nodes (the template matching s:a in our case). A "grey-listed node will not be deleted completely -- it may be renamed or otherwize modified, or at least its contents may still be included in the output.

  4. It is likely that with your understanding of the problem becoming more and more clear, the three lists will continuously grow, so the match pattern for the black-list deleting template will be modified to accomodate the newly discovered black-listed elements. Newly-discovered white-listed nodes require no work at all. Only treating new grey-listed elements (if such are found at all) will require a little bit more work.

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • @SamG: No the solution *is* sound, I simply hadn't seen the `a` element. Fixed now -- do have a look, and we now have one example of processing a grey-listed element. – Dimitre Novatchev Feb 03 '11 at 06:09
  • I won't argue because I'm more interesting in learning xsl. Though consider the HTML policy of this very site. "we do not allow all HTML tags, as that would be an XSS paradise" SVG arn't normally such public documents, but this one just happens to be. So the threat model is the same, and that has nothing to do with pessimism. http://meta.stackexchange.com/questions/1777/what-html-tags-are-allowed-on-stack-overflow-server-fault-and-super-user – SamG Feb 03 '11 at 07:03
  • @SamG: Leaving aside the philosophical discussions, this solution not only completely solves your problem but is a solid methodology for building any such cleaner apps -- why not consider accepting this answer? – Dimitre Novatchev Feb 03 '11 at 13:31
  • @Dimitre In most cases, I think your approach is the most clean and efficient. I think the issue is that he may not know ahead of time what nodes he wants to redact, because it is user generated content. He only knows what content he wants to keep("whitelist"). – Mads Hansen Feb 03 '11 at 16:04
  • @Mads-Hansen: I agree with your comment that the OP doesn't know exactly all the node-types he wants in/out/modified. This is why I included point 4. of the Explanation. The black-list and grey-list will grow in time, reflecting the increasing understanding on this subject. These, together with the white-list completely cover all nodes. There is no need to specify all the three lists -- one of them can be treated with the default action and this gives us convenience. – Dimitre Novatchev Feb 03 '11 at 16:13
4

Dimitre Novatchev's solution is more "clean" and elegant, but if you need a "whitelist" solution (because you can't predict what content users may input that you would need to "blacklist"), then you would need to fully flesh out the "whitelist".

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
    xmlns:svg="http://www.w3.org/2000/svg">
    <xsl:output indent="yes" />

    <!--The "whitelist" template that will copy matched nodes forward and apply-templates
        for any attributes or child nodes -->
    <xsl:template match="svg:svg 
        | svg:defs  | svg:defs/text()
        | svg:g  | svg:g/text()
        | svg:a  | svg:a/text()
        | svg:use   | svg:use/text()
        | svg:rect  | svg:rect/text()
        | svg:circle  | svg:circle/text()
        | svg:ellipse  | svg:ellipse/text()
        | svg:line  | svg:line/text()
        | svg:polyline  | svg:polyline/text()
        | svg:polygon  | svg:polygon/text()
        | svg:path  | svg:path/text()
        | svg:text  | svg:text/text()
        | svg:tspan  | svg:tspan/text()
        | svg:tref  | svg:tref/text()
        | svg:textpath  | svg:textpath/text()
        | svg:linearGradient  | svg:linearGradient/text()
        | svg:radialGradient  | svg:radialGradient/text()
        | svg:clippath  | svg:clippath/text()
        | svg:text | svg:text/text()">
        <xsl:copy>
            <xsl:copy-of select="@*" />
            <xsl:apply-templates select="node()" />
        </xsl:copy>
    </xsl:template>

    <!--The "blacklist" template, which does nothing except apply templates for the 
        matched node's attributes and child nodes -->
    <xsl:template match="@* | node()">
        <xsl:apply-templates select="@* | node()" />
    </xsl:template>

</xsl:stylesheet>
Mads Hansen
  • 63,927
  • 12
  • 112
  • 147
  • It matches the attributes attached to any element bound to the SVG namespace. – Mads Hansen Feb 03 '11 at 02:32
  • I thought I understood this, but when I removed "svg:a |" from the list, I get an error. "Attribute nodes must be added before any child nodes to an element." I used xsltproc to do the actual processing. – SamG Feb 03 '11 at 03:13
  • Also I guess it needs a text() somewhere, since it's discarding all text nodes. – SamG Feb 03 '11 at 03:28
  • I've been hitting the xsl books. The blacklist part is eating ALL the text nodes, but without it, all the text is automatically copied to the new document. – SamG Feb 03 '11 at 05:27
  • 2
    Copying the attributes of an element that isn't itself copied seems questionable to me. They will either end up on the "wrong" element, or they will cause a failure because text nodes have already been written to the parent element. – Michael Kay Feb 03 '11 at 09:12
  • @SamG - I had neglected to include svg:text elements and svg:*/text(). I also added another "blacklist" template that completely eats those non-svg elements(that are in the svg namespace because a namespace prefix was not used for the svg document element) and does not process the @* or child node(), which was causing the stylesheet to emit their attribute in the wrong spot. – Mads Hansen Feb 03 '11 at 13:43
  • The changes that I had to make bring it more in line with @Dimitre Novatchev's solution, however he also suppresses some of the other attributes that you probably want to. His solution is the most "clean", as you really only have to manage the "blacklist" and everything else just works. I'd accept his answer and go with his stylesheet as the basis for your work. – Mads Hansen Feb 03 '11 at 13:48
  • @SamG - I've made another update that stays with the "whitelist" approach and seems to work. Unfortunately, it's a bit verbose, but I believe will produce the desired output and will handle additional random elements that you would want to redact. – Mads Hansen Feb 03 '11 at 14:06
  • @Mads Hansen - Many of those elements aren't allowed to contain text nodes anyway; all the basic shapes for a start. I'm confident that If I spend a few hours with the SVG specifications, your solution is a sound foundation. – SamG Feb 03 '11 at 15:57
  • @Mads Hansen: A "white list" solution can also preserve the identity rule, just use `not()` as in `` –  Feb 07 '11 at 16:27
  • This didn't work for me (it produced an empty document). I've found out that's because my svg didn't have the namespace declaration (xmlns="http://www.w3.org/2000/svg"). Removing all "svg:" namespace qualifiers on the match attribute worked. – HH321 Sep 12 '16 at 10:59
  • 2
    Do I understand it correctly that the @* outputs all attributes? There should be a whitelisting of attributes as well, otherwise stuff like (or worse) passes through – HH321 Sep 15 '16 at 15:18
  • Yes, '@*' matches any attribute and would copy all. You could filter those if you choose. – Mads Hansen Sep 15 '16 at 22:30
  • @HH321 Yeah also the following XSS does not get filtered: ` ` So how do we whitelist the attributes? – Ehsan88 Oct 16 '21 at 12:09
1

svgfig is a good tool for this job. You can load SVG files and pick out parts you like to make a new document. Or you can just remove parts you don't like and re-save.

Ben Jackson
  • 90,079
  • 9
  • 98
  • 150
0

As an alternative to the accepted XSLT answer, you could use Ruby and Nokogiri:

require 'nokogiri'
svg = Nokogiri::XML( IO.read( "myfile.svg" ) )
svg.xpath( '//*[not(name()="rect" or name()="circle" or ...)]' ).each do |node|
  node.remove
end
File.open( "myfile_clean.svg", "w" ) do |file|
  file << svg.to_xml
end
Phrogz
  • 296,393
  • 112
  • 651
  • 745