14

An SVG file is basically an XML file so I could use the string <?xml (or the hex representation: '3c 3f 78 6d 6c') as a magic number but there are a few opposing reason not to do that if for example there are extra white-spaces it could break this check.

The other images I need/expect to check are all binaries and have magic numbers. How can I fast check if the file is an SVG format without using the extension eventually using Python?

yoozer8
  • 7,361
  • 7
  • 58
  • 93
Eduard Florinescu
  • 16,747
  • 28
  • 113
  • 179
  • how about reading the beginning of the file as binary - if you can't find any magic numbers, you read it as text and try to match it to your known textual patterns? – dmg Feb 28 '13 at 13:05
  • @DJV Sounds reasonable. And I don't see how it couldn't break. – Eduard Florinescu Feb 28 '13 at 13:12

3 Answers3

16

XML is not required to start with the <?xml preamble, so testing for that prefix is not a good detection technique — not to mention that it would identify every XML as SVG. A decent detection, and really easy to implement, is to use a real XML parser to test that the file is well-formed XML that contains the svg top-level element:

import xml.etree.cElementTree as et

def is_svg(filename):
    tag = None
    with open(filename, "r") as f:
        try:
            for event, el in et.iterparse(f, ('start',)):
                tag = el.tag
                break
        except et.ParseError:
            pass
    return tag == '{http://www.w3.org/2000/svg}svg'

Using cElementTree ensures that the detection is efficient through the use of expat; timeit shows that an SVG file was detected as such in ~200μs, and a non-SVG in 35μs. The iterparse API enables the parser to forego creating the whole element tree (module name notwithstanding) and only read the initial portion of the document, regardless of total file size.

djvg
  • 11,722
  • 5
  • 72
  • 103
user4815162342
  • 141,790
  • 18
  • 296
  • 355
  • 2
    By reading the question, the mixing of binary magic numbers and XML triggered a red alert. This answer makes clear that parsing a binary format requires one approach, and reading a (text-based) XML requires a COMPLETELY DIFFERENT approach. – heltonbiker Apr 22 '13 at 18:22
  • 2
    @heltonbiker Exactly. Magic numbers do have one thing going for them: raw performance. This is why the answer includes a code sample that demonstrates an *efficient* implementation of the proposed approach. – user4815162342 Apr 22 '13 at 19:10
  • Also, if I understant right, a binary file is inherently unstructured, such as a plain-text file. In a plain-text, then, we should include shebangs, doctypes, and so on, while binary need those terse, cryptic magic numbers. I believe, in that sense, that this magic-number stuff is reminiscent of smallest-size-possible, low-level, "old way" of storing data to files, while XML and JSON to name a few are more modern, human-readable, inflated and redundant way of storing data to files. Both approaches differ in more than one aspect, then. – heltonbiker Apr 22 '13 at 19:15
  • From the [docs](https://docs.python.org/3.8/library/xml.etree.elementtree.html): "Changed in version 3.3: This module will use a fast implementation whenever available. The `xml.etree.cElementTree` module is deprecated." – djvg Apr 02 '21 at 16:12
  • I like this, but beware: The [XML vulnerabilities page](https://docs.python.org/3.8/library/xml.html#xml-vulnerabilities) mentions a vulnerability to [billion laughs](https://stackoverflow.com/q/3451203) and similar attacks. Testing verified that `et.iterparse()` does indeed blow up. The docs recommend [defusedxml](https://docs.python.org/3.8/library/xml.html#the-defusedxml-package). – djvg Apr 02 '21 at 19:42
  • For those wondering about the syntax of the expected `tag` value: an xml tag with svg [namespace](https://developer.mozilla.org/en-US/docs/Web/SVG/Namespaces_Crash_Course#declaring_namespaces) looks like ``, and the `xml` module expands this to `{}` as described in the [docs](https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces), so that becomes `'{http://www.w3.org/2000/svg}svg'`. – djvg Apr 06 '21 at 09:11
2

You could try reading the beginning of the file as binary - if you can't find any magic numbers, you read it as a text file and match to any textual patterns you wish. Or vice-versa.

dmg
  • 7,438
  • 2
  • 24
  • 33
1

This is from man file (here), for the unix file command:

The magic tests are used to check for files with data in particular fixed formats. The canonical example of this is a binary executable ... These files have a “magic number” stored in a particular place near the beginning of the file that tells the UNIX operating system that the file is a binary executable, and which of several types thereof. The concept of a “magic” has been applied by extension to data files. Any file with some invariant identifier at a small fixed offset into the file can usually be described in this way. ...

(my emphasis)

And here's one example of the "magic" that the file command uses to identify an svg file (see source for more):

...
0       string        \<?xml\ version=
>14     regex         ['"\ \t]*[0-9.]+['"\ \t]*
>>19    search/4096   \<svg         SVG Scalable Vector Graphics image
...
0       string        \<svg         SVG Scalable Vector Graphics image
...

As described by man magic, each line follows the format <offset> <type> <test> <message>.

If I understand correctly, the code above looks for the literal "<?xml version=". If that is found, it looks for a version number, as described by the regular expression. If that is found, it searches the next 4096 bytes until it finds the literal "<svg". If any of this fails, it looks for the literal "<svg" at the start of the file, and so on.

Something similar could be implemented in Python.

Note there's also python-magic, which provides an interface to libmagic, as used by the unix file command.

djvg
  • 11,722
  • 5
  • 72
  • 103
  • An xml file might start with a BOM (byte order mark). As this code seems to read ` – mortb Sep 05 '22 at 09:22
  • @mortb The code above represents just one of many patterns from the [source](https://github.com/file/file/blob/f58043c39891ff165660b84bc342820106ded5b2/magic/Magdir/sgml) of the `file` command for Linux. – djvg Sep 05 '22 at 10:40
  • I understand. I think it is great that you used a ready made solution. I just made the remark in case someone gets an edge case error due to having a BOM at the beginning of the svg file. – mortb Sep 05 '22 at 15:20