18

I'm new to XML, so this may be a fairly easy question to answer. I was wondering if there is a standard way of referencing external XML files from within other XML files. Let me give an example. Say you have a file which defines a single object that holds a large amount of data:

<person>
    <name>John</name>
    <age>18</age>
    <hair>Brown</hair>
    <eyes>Blue</eyes>
</person>

For the sake of this question, pretend that person holds loads of other information. Pretend the file is like 10 MB.

Now, let's say you have another XML file which defines a group:

<group>
    <person>
        <name>John</name>
        <age>18</age>
        <hair>Brown</hair>
        <eyes>Blue</eyes>
    </person>
    <person>
        <name>Kim</name>
        <age>21</age>
        <hair>Blue</hair>
        <eyes>Green</eyes>
    </person>
    <person>
        <name>Sean</name>
        <age>22</age>
        <hair>Black</hair>
        <eyes>Brown</eyes>
    </person>
</group>

As you can see, if Person's were very large, the Group file would be extremely large. So, if we have something like John.xml, is there a standard way to reference it in Group.xml without explicitly defining all of John's data? I'm sure this is a very broad topic, so feel free to link me to any relevant web pages. Thanks!

5 Answers5

12

Standards

XInclude is the only standard with any level of support.

  • Several XML editors, including Oxygen and xmlspy support it.
  • Several XML parsers, including Xerces, also support it, and there are .net ports too.
  • Several XML tools, such as Saxon support it, both for Java and .net.

There are some good examples of use in the Wikipedia article on XInclude.

XLink is a tangentially-related standard, not really for including documents, but more for citing portions within other documents. It's not well supported.

Alternatives

If you are worried about size, there are several ways to go:

  • Use a streaming XML processor, such as DataDirect XQuery (or to a lesser extent, Saxon 9.3 EE, which only keeps as much information in memory as necessary to solve the query.
  • Use an XML database, such as MarkLogic or eXist.
  • Use one XML file to list the names of other XML files, which some program written in XQuery or XSLT then reads using the doc() function and processes. (Unless your processor is streaming or has a way to dispose of documents it is finished with, as DDXQ or Saxon do, you will still run into the same size problem through.)
Community
  • 1
  • 1
lavinio
  • 23,931
  • 5
  • 55
  • 71
  • 1
    XInclude is probably the best answer to this question. However, parsing the file with XInclude processing enabled will just lead to the same (ok, almost the same) structure in memory as having a single large file. – Nic Gibson Jul 01 '09 at 07:47
  • Post-question: how well does XInclude work with referencing (ID/IDREF and ref/keyref types)? – Rekin Jul 08 '13 at 10:36
5

There are a couple of "standard" ways to do what you want, namely XLink and XInclude (depending on what you want to do), though you have to make sure that you have a processor that can pull in the external references. Most XML libraries don't come with this functionality already enabled.

Then you'd be able to do something like:

<group>
  <personlink xlink:href="person.xml" xlink:show="embed" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</group>

However, you probably don't really need this. If you need a subset of information from a large document, you can easily use XSLT or XQuery to trim out the parts that you need. You can use this approach, along with SAX parsing - which is event based and doesn't have the whole document in memory - to scale you application to handle fairly large documents.

Even while using DOM, I didn't start to see problems with large documents until they were in the tens of megabytes range.

Chris Scott
  • 1,721
  • 14
  • 27
3

Here is the XML specification for DTD, in which you can declare entity references.

A simple document like:

<!DOCTYPE test [
    <!ENTITY ref SYSTEM "file:///C:/test.txt" >
]>

<test>
    &ref;
</test>

And file:///C:/test.txt being:

<blah>
Fee
Fi
Fo
Fum
</blah>

will expand the original document to:

<test>
    <blah>
    Fee
    Fi
    Fo
    Fum
    </blah>
</test>

I do believe non-validating XML parsers are not required to expand out the references, so be cautious there.

Also, don't forget to put standalone="no" in the XMLDecl. (Not having the standalone attribute assumes it equals "no", but its still better to put it there...)

DeadHead
  • 2,251
  • 16
  • 16
0

Um, there are no size limitations on xml files. you shouldn't worry about extremely large sizes. But remember; Xml is a data exchange format, not a database format. You use xml to swap data between different applications/services.

Makach
  • 7,435
  • 6
  • 28
  • 37
  • Right, but I don't want to end up with a 30 GB XML file. I simplified my problem for the purpose of the question, but if I were to include all of the "Persons" in a single file, this is what I would have. –  Jun 30 '09 at 18:12
  • Well maybe you need to look other data structures? I wouldn't be daunted by a 30GB xml file. But if you must, divide and conquer. I know there are ways to "embed/link" xml files, but that operation on many small files would defy not having it in a single file imo. – Makach Jun 30 '09 at 18:39
0

There is no standard (will work in every parser) for importing nodes like that. But you could save space by changing some of your elements in to attributes

<group>
  <person name='John' age='18' hair='Brown' eyes='Blue' />
  <person name='Kim' age='21' hair='Blue' eyes='Green' />
  <person name='Sean' age='22' hair='Black' eyes='Brown' />
</group>
Matthew Whited
  • 22,160
  • 4
  • 52
  • 69
  • 1
    I tried to emphasize that "Person" stores a lot of information about each person. If I did this, I would have person tags a million miles long. –  Jun 30 '09 at 18:13
  • And why do you feel that would be a problem? – John Saunders Jun 30 '09 at 18:27
  • 1
    Well, XML is supposed to be easily readable, and if there are attribute lists millions of miles long, that doesn't sound so readable... – DeadHead Jun 30 '09 at 18:44
  • That comes down to formatting. you can put each attribute on it's own line if you want. But are you really going to process that large of an xml document by hand? – Matthew Whited Jun 30 '09 at 18:54
  • 1
    Yeah, you've got a point there (the processing by hand). Sometimes I just don't think of things like that... – DeadHead Jun 30 '09 at 19:02
  • From the XML spec: "6. XML documents should be human-legible and reasonably clear" neither of which really requires the file to be "easily readable" - see http://www.w3.org/TR/2008/REC-xml-20081126/#sec-origin-goals – barrowc Jul 01 '09 at 02:29
  • 1
    From [W3C schools](http://www.w3schools.com/dtd/dtd_el_vs_attr.asp): Some of the problems with attributes are: 1) attributes cannot contain multiple values (child elements can) 2) attributes are not easily expandable (for future changes) 3) attributes cannot describe structures (child elements can) 4) attributes are more difficult to manipulate by program code 5) attribute values are not easy to test against a DTD – brims Jun 11 '15 at 11:14