parse multi-document RELAX-NG schema using libxml2

Question

I want to convert a RELAX-NG schema to a schemaInfo object so that it can be used in codemirror for xml-completion.

https://codemirror.net/demo/xmlcomplete.html

xmllint usage

libxml2 already has support for a multi-document relax-NG schema when used to validate a document like this:

xmllint --schema myschema.rng mydoc.xml

Question

Can libxml2 also be used to parse a multi-document schema file?

Here is an example for a multi-document schema:

https://docs.oasis-open.org/office/v1.1/errata01/os/OpenDocument-strict-schema-v1.1-errata01-complete.rng

here is some libxml2 functionality i don't understand but which could be helpful:

http://xmlsoft.org/html/libxml-relaxng.html#xmlRelaxNGDump

Assumption

I think I have to convert the multi-document schema into a single document schema using tools like: https://github.com/h4l/rnginline/tree/master/rnginline

Using libxml2 directly would be great since I could then support schemas without pre-processing.

update 3.5.2016

as you can see parsing the relax-NG schema shows only the top level file and it will not contain any files which are included using the include directive from the relax-NG main file (note: relax-NG schemas can be spilit into several files).

<!-- XHTML Basic -->

<grammar ns="http://www.w3.org/1999/xhtml"
         xmlns="http://relaxng.org/ns/structure/1.0">

<include href="modules/datatypes.rng"/>
<include href="modules/attribs.rng"/>
<include href="modules/struct.rng"/>
<include href="modules/text.rng"/>
<include href="modules/hypertext.rng"/>
<include href="modules/list.rng"/>
<include href="modules/basic-form.rng"/>
<include href="modules/basic-table.rng"/>
<include href="modules/image.rng"/>
<include href="modules/param.rng"/>
<include href="modules/object.rng"/>
<include href="modules/meta.rng"/>
<include href="modules/link.rng"/>
<include href="modules/base.rng"/>

</grammar>

source code

/**
 * section: Tree
 * synopsis: Navigates a tree to print element names
 * purpose: Parse a file to a tree, use xmlDocGetRootElement() to
 *          get the root element, then walk the document and print
 *          all the element name in document order.
 * usage: tree1 filename_or_URL
 * test: tree1 test2.xml > tree1.tmp && diff tree1.tmp $(srcdir)/tree1.res
 * author: Dodji Seketeli
 * copy: see Copyright for the status of this software.
 */
#include <stdio.h>
#include <libxml/parser.h>
#include <libxml/tree.h>

#ifdef LIBXML_TREE_ENABLED


#define ANSI_COLOR_RED     "\x1b[31m"
#define ANSI_COLOR_GREEN   "\x1b[32m"
#define ANSI_COLOR_YELLOW  "\x1b[33m"
#define ANSI_COLOR_BLUE    "\x1b[34m"
#define ANSI_COLOR_MAGENTA "\x1b[35m"
#define ANSI_COLOR_CYAN    "\x1b[36m"
#define ANSI_COLOR_RESET   "\x1b[0m"


/*
 *To compile this file using gcc you can type
 *gcc `xml2-config --cflags --libs` -o xmlexample libxml2-example.c
 */

/**
 * print_element_names:
 * @a_node: the initial xml node to consider.
 *
 * Prints the names of the all the xml elements
 * that are siblings or children of a given xml node.
 */

char* pad(int depth) {
//   if (depth <= 0)
//     return "";
  char str[2000];
//   sprintf(str, "%*s", " ", depth);
  for (int i=0; i <= depth; i++) {
    str[i] = ' ';
  }
  str[depth+1] = 0;
  return &str;
}

static void
print_element_names(xmlNode * a_node, int depth)
{
    xmlNode *cur_node = NULL;

    for (cur_node = a_node; cur_node; cur_node = cur_node->next) {
        if (cur_node->type == XML_ELEMENT_NODE) {
//        if (strcmp(cur_node->name, "element") == 0) {
//             printf("node type: Element, name: %s\n", cur_node->name);
            printf("%s %s\n", pad(depth), cur_node->name);
            for(xmlAttrPtr attr = cur_node->properties; NULL != attr; attr = attr->next)
            {
                printf("%s", ANSI_COLOR_MAGENTA);
                printf("%s %s: ", pad(depth), attr->name);
                xmlChar* value = xmlNodeListGetString(cur_node->doc, attr->children, 1);
                printf("%s \n", value);
                printf("%s", ANSI_COLOR_RESET);
            }
//   }

        }

        print_element_names(cur_node->children, depth+1);
    }
}


/**
 * Simple example to parse a file called "file.xml",
 * walk down the DOM, and print the name of the
 * xml elements nodes.
 */
int
main(int argc, char **argv)
{
    xmlDoc *doc = NULL;
    xmlNode *root_element = NULL;

    if (argc != 2)
        return(1);

    /*
     * this initialize the library and check potential ABI mismatches
     * between the version it was compiled for and the actual shared
     * library used.
     */
    LIBXML_TEST_VERSION

    /*parse the file and get the DOM */
    doc = xmlReadFile(argv[1], NULL, 0);

    if (doc == NULL) {
        printf("error: could not parse file %s\n", argv[1]);
    }

    /*Get the root element node */
    root_element = xmlDocGetRootElement(doc);

    print_element_names(root_element, 0);

    /*free the document */
    xmlFreeDoc(doc);

    /*
     *Free the global variables that may
     *have been allocated by the parser.
     */
    xmlCleanupParser();

    return 0;
}
#else
int main(void) {
    fprintf(stderr, "Tree support not compiled in\n");
    exit(1);
}
#endif

example usage

[nix-shell:~/Desktop/projects/nlnet/nlnet]$ ./tree1 html5-rng/xhtml-basic.rng
 grammar
  ns: http://www.w3.org/1999/xhtml 
   include
   href: modules/datatypes.rng 
   include
   href: modules/attribs.rng 
   include
   href: modules/struct.rng 
   include
   href: modules/text.rng 
   include
   href: modules/hypertext.rng 
   include
   href: modules/list.rng 
   include
   href: modules/basic-form.rng 
   include
   href: modules/basic-table.rng 
   include
   href: modules/image.rng 
   include
   href: modules/param.rng 
   include
   href: modules/object.rng 
   include
   href: modules/meta.rng 
   include
   href: modules/link.rng 
   include
   href: modules/base.rng

If `xmllint` supports multi-document schemas, `libxml2` does as well. So what have you tried? Why didn't it work? — nwellnhof, Apr 27 '16 at 12:11
as i said: i don't want to use a multi-document schema to validate a xml document but instead want to parse the multi-document schema using libxml. — qknight, Apr 30 '16 at 14:49
By "using libxml2 directly", do you mean "using libxml2 in a C program"? — mzjn, May 02 '16 at 16:28
@mzjn: updated the question with some source code from the official tree1.c example which basically opens an XML file, parses it and prints the results to the shell. my version contains color codes, which is basically the only difference to the upstream example. — qknight, May 03 '16 at 10:31
The question may have not attracted many people because it's very length supplies too many unnecessary details. I found a possible [solution](https://stackoverflow.com/a/72360791/213871). — ceztko, May 24 '22 at 09:58

ceztko · Answer 1 · 2022-05-26T12:28:05.643

Although the question is unnecessary lengthy, it's clear what's being asked for. As of version 2.9.14, Libxml2 appear to be not able to resolve the includes other than resolving an URL or looking in the filesystem, probably searching for a filename of the name of the href attribute in the current directory. This may already answer the question but it may be insufficient if the schema has to be loaded from buffers in memory. A clean approach could be supplying a callback to resolve the rng:include directives but it doesn't seem Libxml2 provides such API. Another approach, which could actually lead to more efficient operations, is to recursively merge the outer schema in a single one without the include directives. The following code worked for me merging a medium complexity schema (8 files). Just change the paths and filenames accordingly.

#include <memory>
#include <string>
#include <stdexcept>
#include <unordered_set>
#include <filesystem>

#include <libxml/tree.h>
#include <libxml/xmlsave.h>

using namespace std;
namespace fs = std::filesystem;

using DocPtr = std::unique_ptr<xmlDoc, decltype(&xmlFreeDoc)>;

constexpr const char* SchemaBasePath = R"(D:\Schemas)";
constexpr const char* RngSchemaFilename = "Schema.rng";
constexpr const char* MergedSchemaSavePath = R"(D:\Schemas\Schema_Merged.rng)";
constexpr const char* RngNS = "rng";
constexpr const char* RngNSHref = "http://relaxng.org/ns/structure/1.0";

struct Qualifier
{
    bool IsNamespace;
    string Name;
    string Value;
};

static DocPtr readDoc(const string_view& filepath);
static void followDoc(xmlDocPtr doc, vector<xmlNodePtr>& nodes, vector<Qualifier>& qualifiers);
static void followDoc(xmlNodePtr root, vector<xmlNodePtr>& nodes, vector<Qualifier>& qualifiers);
static void removeNode(xmlNodePtr element);
static string findHRef(const xmlNodePtr element);
static string getAttributeContent(const xmlAttrPtr attr);
static void saveDocToFile(xmlDocPtr doc, const string_view& filepath);
static void addNamespaceTo(vector<Qualifier>& qualifiers, xmlNsPtr ns);
static void addAttributeTo(vector<Qualifier>& qualifiers, xmlAttrPtr attr);

unordered_set<string> s_schemas;

int main()
{
    LIBXML_TEST_VERSION;
    auto packetRngPath = fs::u8path(SchemaBasePath) / RngSchemaFilename;
    auto packetRngDoc = readDoc(packetRngPath.u8string());

    vector<xmlNodePtr> nodes;
    vector<Qualifier> qualifiers;
    followDoc(packetRngDoc.get(), nodes, qualifiers);

    auto newDoc = DocPtr(xmlNewDoc(nullptr), &xmlFreeDoc);
    auto grammarNode = xmlNewChild((xmlNodePtr)newDoc.get(), nullptr, (const xmlChar*) "grammar", nullptr);
    if (grammarNode == nullptr)
        throw runtime_error("Can't create rng:grammar node");

    auto rngNs = xmlNewNs(grammarNode, (const xmlChar*)RngNSHref, (const xmlChar*)RngNS);
    if (rngNs == nullptr)
        throw runtime_error("Can't find or create rng namespace");
    xmlSetNs(grammarNode, rngNs);

    for (auto qualifier : qualifiers)
    {
        // Recreate the gathered namespaces and attributes
        if (qualifier.IsNamespace)
        {
            xmlNewNs(grammarNode, (const xmlChar*)qualifier.Value.data(),
                (const xmlChar*)qualifier.Name.data());
        }
        else
        {
            xmlNewProp(grammarNode, (const xmlChar*)qualifier.Name.data(),
                (const xmlChar*)qualifier.Value.data());
        }
    }

    for (auto node : nodes)
    {
        if (xmlAddChild(grammarNode, node) == nullptr)
            throw runtime_error("Can't add child node to grammar");
    }

    // This actually fixes the copied namespaces
    // to share just one instance
    if (xmlReconciliateNs(newDoc.get(), grammarNode) == -1)
        throw runtime_error("Can't reconciliate namespaces");

    saveDocToFile(newDoc.get(), MergedSchemaSavePath);

    return 0;
}

DocPtr readDoc(const string_view& filepath)
{
    return DocPtr(xmlReadFile(filepath.data(), nullptr,
        XML_PARSE_NOBLANKS), &xmlFreeDoc);
}

void followDoc(xmlDocPtr doc, vector<xmlNodePtr>& nodes, vector<Qualifier>& qualifiers)
{
    auto root = xmlDocGetRootElement(doc);

    // Fetch namespaces
    auto namespaces = xmlGetNsList(doc, root);
    unsigned i = 0;
    while (true)
    {
        auto ns = namespaces[i];
        if (ns == nullptr)
            break;

        addNamespaceTo(qualifiers, ns);
        i++;
    }
    xmlFree(namespaces);

    // Fetch attributes
    for (xmlAttrPtr attribute = root->properties; attribute; attribute = attribute->next)
        addAttributeTo(qualifiers, attribute);

    followDoc(root, nodes, qualifiers);
}

void followDoc(xmlNodePtr root, vector<xmlNodePtr>& nodes, vector<Qualifier>& qualifiers)
{
    for (auto child = xmlFirstElementChild(root); child; child = xmlNextElementSibling(child))
    {
        string href;
        if (child->ns != nullptr
            && string_view((const char*)child->ns->prefix) == "rng"
            && string_view((const char*)child->name) == "include"
            && (href = findHRef(child)).length() != 0)
        {
            if (s_schemas.find(href) == s_schemas.end())
            {
                auto schemaPath = fs::u8path(SchemaBasePath) / href;
                auto doc = readDoc(schemaPath.u8string());
                s_schemas.insert(href);
                followDoc(doc.get(), nodes, qualifiers);
            }

            continue;
        }

        auto copied = xmlCopyNode(child, 1);
        if (copied == nullptr)
            throw runtime_error("Can't copy child node");

        nodes.push_back(copied);
    }
}

void addNamespaceTo(vector<Qualifier>& qualifiers, xmlNsPtr xmlNs)
{
    for (auto ns : qualifiers)
    {
        // Ensure the namespace has not yet been added first
        if (ns.IsNamespace && ns.Name == (const char*)xmlNs->prefix)
            return;
    }
    qualifiers.push_back({ true, (const char*)xmlNs->prefix, (const char*)xmlNs->href });
}

void addAttributeTo(vector<Qualifier>& qualifiers, xmlAttrPtr xmlAttr)
{
    for (auto attr : qualifiers)
    {
        // Ensure the namespace has not yet been added first
        if (!attr.IsNamespace && attr.Name == (const char*)xmlAttr->name)
            return;
    }
    qualifiers.push_back({ false, (const char*)xmlAttr->name, getAttributeContent(xmlAttr) });
}

void removeNode(xmlNodePtr element)
{
    // Remove the existing ModifyDate. We recreate the element
    xmlUnlinkNode(element);
    xmlFreeNode(element);
}

string findHRef(const xmlNodePtr element)
{
    for (xmlAttrPtr attr = element->properties; attr; attr = attr->next)
    {
        if (string_view((const char*)attr->name) == "href")
            return getAttributeContent(attr);
    }

    return { };
}

string getAttributeContent(const xmlAttrPtr attr)
{
    xmlChar* content = xmlNodeGetContent((const xmlNode*)attr);
    if (content == nullptr)
        return { };

    unique_ptr<xmlChar, decltype(xmlFree)> contentFree(content, xmlFree);
    return string((const char*)content);
}

void saveDocToFile(xmlDocPtr doc, const string_view& filepath)
{
    auto ctx = xmlSaveToFilename(filepath.data(), "utf-8", XML_SAVE_FORMAT);
    if (ctx == nullptr || xmlSaveDoc(ctx, doc) == -1 || xmlSaveClose(ctx) == -1)
        throw runtime_error("Can't save XML document");
}

score -1 · Answer 2 · edited May 23 '17 at 12:00

Can libxml2 also be used to parse a multi-document schema file?

xmllint calls the xmlRelaxNGValidateDoc method of libxml2:

xmlRelaxNGValidateDoc(xmlRelaxNGValidCtxtPtr ctxt,xmlDocPtr doc)

For example:

 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/types.h>

 #include <libxml/xmlmemory.h>
 #include <libxml/parser.h>
 #include <libxml/relaxng.h>

 int main(int argc, char *argv[])
 {
    int status;
    xmlDoc *doc;
    xmlRelaxNGPtr schema;
    xmlRelaxNGValidCtxtPtr validctxt;
    xmlRelaxNGParserCtxtPtr rngparser;

    doc = xmlParseFile(argv[1]);

    rngparser = xmlRelaxNGNewParserCtxt(argv[2]);
    schema = xmlRelaxNGParse(rngparser);
    validctxt = xmlRelaxNGNewValidCtxt(schema);

    status = xmlRelaxNGValidateDoc(validctxt, doc);
    printf("status == %d\n", status);

    xmlRelaxNGFree(schema);
    xmlRelaxNGFreeValidCtxt(validctxt);
    xmlRelaxNGFreeParserCtxt(rngparser);
    xmlFreeDoc(doc);
    exit(EXIT_SUCCESS);
 }

Validates the following source:

<?xml version="1.0"?>
<root>
  <t>foo</t>
</root>

with the following schema:

<?xml version="1.0" encoding="UTF-8"?>
<grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0">
  <start>
    <element name="t">
      <ref name="tcont"/>
    </element>
  </start>
  <define name="tcont">
    <text/>
  </define>
</grammar>

The difference is between support for the externalRef element:

The externalRef pattern can be used to reference a pattern defined in a separate file. The externalRef element has a required href attribute that specifies the URL of a file containing the pattern. The externalRef matches if the pattern contained in the specified URL matches.

For example:

<?xml version="1.0" encoding="UTF-8"?>
<grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0">
  <start>
    <element name="root">
      <externalRef href="595792-ext.rng"/>
    </element>
  </start>
</grammar>

versus the include element:

The include element allows grammars to be merged together. A grammar pattern may have include elements as children. An include element has a required href attribute that specifies the URL of a file containing a grammar pattern. The definitions in the referenced grammar pattern will be included in grammar pattern containing the include element.

The combine attribute is particularly useful in conjunction with include. If a grammar contains multiple definitions with the same name, then the definitions must specify how they are to be combined into a single definition by using the combine attribute.

For example:

demo.rng

<?xml version="1.0" encoding="iso-8859-1"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
 datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

<include href="demo2.rng">
<define name="TEI.prose"><ref name="INCLUDE"/></define>
</include>
</grammar>

demo2.rng

<?xml version="1.0" encoding="utf-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0" xmlns:t="http://www.thaiopensource.com/ns/annotations" xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

   <start>
         <ref name="TEI.2"/>
   </start>
   <define name="IGNORE">
      <notAllowed/>
   </define>
   <define name="INCLUDE">
      <empty/>
   </define>


  <include href="demo3.rng"/>

   <define name="TEI.2">
      <element name="TEI.2">
         <text/>
      </element>
   </define>

</grammar>

demo3.rng

<?xml version="1.0" encoding="utf-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0" xmlns:t="http://www.thaiopensource.com/ns/annotations" xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

   <define name="TEI.prose" combine="interleave">
      <ref name="IGNORE"/>
   </define>

</grammar>

References

This answer doesn't really grasp what was asked for. – ceztko May 24 '22 at 09:53 — ceztko, May 24 '22 at 09:53

parse multi-document RELAX-NG schema using libxml2

xmllint usage

Question

Assumption

update 3.5.2016

source code

example usage

2 Answers2