Regex character class subtraction in C++

Question

I'm writing a C++ program that will need to take regular expressions that are defined in a XML Schema file and use them to validate XML data. The problem is, the flavor of regular expressions used by XML Schemas does not seem to be directly supported in C++.

For example, there are a couple special character classes \i and \c that are not defined by default and also the XML Schema regex language supports something called "character class subtraction" that does not seem to be supported in C++.

Allowing the use of the \i and \c special character classes is pretty simple, I can just look for "\i" or "\c" in the regular expression and replace them with their expanded versions, but getting character class subtraction to work is a much more daunting problem...

For example, this regular expression that is valid in an XML Schema definition throws an exception in C++ saying it has unbalanced square brackets.

#include <iostream>
#include <regex>

int main()
{
    try
    {
        // Match any lowercase letter that is not a vowel
        std::regex rx("[a-z-[aeiuo]]");
    }
    catch (const std::regex_error& ex)
    {
        std::cout << ex.what() << std::endl;
    }
}

How can I get C++ to recognize character class subtraction within a regex? Or even better, is there a way to just use the XML Schema flavor of regular expressions directly within C++?

This question was closed for needing debugging details, which is bizzare. There is no code that needs debugging here; the OP is asking for how to get XML schema flavor regex support in `std::regex`. — cigien, Mar 09 '21 at 03:16

score 3 · Answer 1 · answered Mar 10 '21 at 01:21

3

Character ranges subtraction or intersection is not available in any of the grammars supported by std::regex, so you will have to rewrite the expression into one of the supported ones.

The easiest way is to perform the subtraction yourself and pass the set to std::regex, for instance [bcdfghjklvmnpqrstvwxyz] for your example.

Another solution is to find either a more featureful regular expression engine or a dedicated XML library that supports XML Schema and its regular expression language.

answered Mar 10 '21 at 01:21

Acorn

24,970
5
40
69

Unfortunately the first option doesn't work for me because I need to support user provided regex strings as long as those strings fit the XML Schema regular expression definition (which supports character class subtraction). It looks like I will need to find a way to use a non-standard regex engine in C++ that does have this support, but I don't have the slightest clue where to find one. The only thing that comes to mind is "boost", but even the boost regex library does not support this flavor of regex. – tjwrona1992 Mar 10 '21 at 01:33
@tjwrona1992 What I am suggesting is that you parse the regular expression at runtime and compute the range on the fly, replacing the original expression with one that `std::regexp` understands. – Acorn Mar 10 '21 at 01:54
@tjwrona1992 On library suggestions, perhaps the Apache XML one works since, if I recall correctly, supported all kinds of W3 documents. I assume if they support schemas, then they need to support the regular expression language defined there. – Acorn Mar 10 '21 at 01:55
I see, computing it on the fly may be a viable solution, but will definitely take quite a bit of time and effort to implement. I'll check out the Apache XML library and see if it has what I need in it. I also took a look at Qt and Qt5 does have a `RegExp` library that looks like it supports XML Schema regular expressions, but it appears to be deprecated in Qt6 and I am really trying to avoid using software that won't be supported in the future. – tjwrona1992 Mar 10 '21 at 20:20
@tjwrona1992 I just took a look and Apache Xerces claims to support all XML Schema 1.0, 1.1 and XSD. Plus you get the rest of the features for free which you may need in the future, not to mention the difficulty of doing so properly. Hopefully that helps you! – Acorn Mar 11 '21 at 09:03
I'm looking through that now, but as far as I can tell it only seems to mention reading existing XML files and validating them against the schema. I'm looking for a way to take data in a C++ program and validate the data against an XML Schema to write XML files that are conformant to the schema. – tjwrona1992 Mar 11 '21 at 13:56
1

@tjwrona: I agree with Acorn. The code base of Apache Xerces must contain a feature-complete implementation of the XSD regex standard. It should not matter whether the input string comes from an XML document or from the 'info set' held in a DOM (or some other type of C++ object). – kimbert Mar 13 '21 at 12:40
@kimbert, after a lot of digging through the Xerces documentation I think it may actually have what I am looking for here: http://xerces.apache.org/xerces-c/apiDocs-3/classXMLString.html#aeecfcbf4663b63758fe7692457b4cc98, Still haven't had a chance to try it out and verify yet but I hope to have time this weekend. – tjwrona1992 Mar 17 '21 at 17:00

Bob · Answer 2 · 2021-03-11T21:18:50.233

Starting from the cppreference examples

#include <iostream>
#include <regex>
 
void show_matches(const std::string& in, const std::string& re)
{
    std::smatch m;
    std::regex_search(in, m, std::regex(re));
    if(m.empty()) {
        std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
    } else {
        std::cout << "input=[" << in << "], regex=[" << re << "]: ";
        std::cout << "prefix=[" << m.prefix() << "] ";
        for(std::size_t n = 0; n < m.size(); ++n)
            std::cout << " m[" << n << "]=[" << m[n] << "] ";
        std::cout << "suffix=[" << m.suffix() << "]\n";
    }
}
 
int main()
{
    // greedy match, repeats [a-z] 4 times
    show_matches("abcdefghi", "(?:(?![aeiou])[a-z]){2,4}");
}

You can test and check the the details of the regular expression here.

The choice of using a non capturing group (?: ...) is to prevent it from changing your groups in case you will use it in a bigger regular expression.

(?![aeiou]) will match without consuming the input if finds a character not matching [aeiou], the [a-z] will match letters. Combining these two condition is equivalent to your character class subtraction.

The {2,4} is a quantifier that says from 2 to 4, could also be + for one or more, * for zero or more.

Edit

Reading the comments in the other answer I understand that you want to support XMLSchema.

The next program shows how to use ECMA regular expression to translate the "character class differences" to a ECMA compatible format.

#include <iostream>
#include <regex>
#include <string>
#include <vector>

std::string translated_regex(const std::string &pattern){
    // pattern to identify character class subtraction
    std::regex class_subtraction_re(
       "\\[((?:\\\\[\\[\\]]|[^[\\]])*)-\\[((?:\\\\[\\[\\]]|[^[\\]])*)\\]\\]"
    );
    // translate the regular expression to ECMA compatible
    std::string translated = std::regex_replace(pattern, 
       class_subtraction_re, "(?:(?![$2])[$1])");
    return translated;
}
void show_matches(const std::string& in, const std::string& re)
{
    std::smatch m;
    std::regex_search(in, m, std::regex(re));
    if(m.empty()) {
        std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
    } else {
        std::cout << "input=[" << in << "], regex=[" << re << "]: ";
        std::cout << "prefix=[" << m.prefix() << "] ";
        for(std::size_t n = 0; n < m.size(); ++n)
            std::cout << " m[" << n << "]=[" << m[n] << "] ";
        std::cout << "suffix=[" << m.suffix() << "]\n";
    }
}



int main()
{
    std::vector<std::string> tests = {
        "Some text [0-9-[4]] suffix", 
        "([abcde-[ae]])",
        "[a-z-[aei]]|[A-Z-[OU]] "
    };
    std::string re = translated_regex("[a-z-[aeiou]]{2,4}");
    show_matches("abcdefghi", re);
    
    for(std::string test : tests){
       std::cout << " " << test << '\n' 
        << "   -- " << translated_regex(test) << '\n'; 
    }
    
    return 0;
}

Edit: Recursive and Named character classes

The above approach does not work with recursive character class negation. And there is no way to deal with recursive substitutions using only regular expressions. This rendered the solution far less straight forward.

The solution has the following levels

one function scans the regular expression for a [
when a [ is found there is a function to handle the character classes recursively when '-[` is found.
The pattern \p{xxxxx} is handled separately to identify named character patterns. The named classes are defined in the specialCharClass map, I fill two examples.

#include <iostream>
#include <regex>
#include <string>
#include <vector>
#include <map>

std::map<std::string, std::string> specialCharClass = {
    {"IsDigit", "0-9"},
    {"IsBasicLatin", "a-zA-Z"}
    // Feel free to add the character classes you want
};

const std::string getCharClassByName(const std::string &pattern, size_t &pos){
    std::string key;
    while(++pos < pattern.size() && pattern[pos] != '}'){
        key += pattern[pos];
    }
    ++pos;
    return specialCharClass[key];
}

std::string translate_char_class(const std::string &pattern, size_t &pos){
    
    std::string positive;
    std::string negative;
    if(pattern[pos] != '['){
        return "";
    }
    ++pos;
    
    while(pos < pattern.size()){
        if(pattern[pos] == ']'){
            ++pos;
            if(negative.size() != 0){
                return "(?:(?!" + negative + ")[" + positive + "])";
            }else{
                return "[" + positive + "]";
            }
        }else if(pattern[pos] == '\\'){
            if(pos + 3 < pattern.size() && pattern[pos+1] == 'p'){
                positive += getCharClassByName(pattern, pos += 2);
            }else{
                positive += pattern[pos++];
                positive += pattern[pos++];
            }
        }else if(pattern[pos] == '-' && pos + 1 < pattern.size() && pattern[pos+1] == '['){
            if(negative.size() == 0){
                negative = translate_char_class(pattern, ++pos);
            }else{
                negative += '|';
                negative = translate_char_class(pattern, ++pos);
            }
        }else{
            positive += pattern[pos++];
        }
    }
    return '[' + positive; // there is an error pass, forward it
}

std::string translate_regex(const std::string &pattern, size_t pos = 0){
    std::string r;
    while(pos < pattern.size()){
        if(pattern[pos] == '\\'){
            r += pattern[pos++];
            r += pattern[pos++];
        }else if(pattern[pos] == '['){
            r += translate_char_class(pattern, pos);
        }else{
            r += pattern[pos++];
        }
    }
    return r;
}

void show_matches(const std::string& in, const std::string& re)
{
    std::smatch m;
    std::regex_search(in, m, std::regex(re));
    if(m.empty()) {
        std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
    } else {
        std::cout << "input=[" << in << "], regex=[" << re << "]: ";
        std::cout << "prefix=[" << m.prefix() << "] ";
        for(std::size_t n = 0; n < m.size(); ++n)
            std::cout << " m[" << n << "]=[" << m[n] << "] ";
        std::cout << "suffix=[" << m.suffix() << "]\n";
    }
}



int main()
{
    std::vector<std::string> tests = {
        "[a]",
        "[a-z]d",
        "[\\p{IsBasicLatin}-[\\p{IsDigit}-[89]]]",
        "[a-z-[aeiou]]{2,4}",
        "[a-z-[aeiou-[e]]]",
        "Some text [0-9-[4]] suffix", 
        "([abcde-[ae]])",
        "[a-z-[aei]]|[A-Z-[OU]] "
    };
    
    for(std::string test : tests){
       std::cout << " " << test << '\n' 
        << "   -- " << translate_regex(test) << '\n'; 
        // Construct a reegx (validate syntax)
        std::regex(translate_regex(test)); 
    }
    std::string re = translate_regex("[a-z-[aeiou-[e]]]{2,10}");
    show_matches("abcdefghi", re);
    
    return 0;
}

This seems really close to what I need, but it doesn't quite work when you have nested character class subtraction, for example `[a-z-[abc-[b]]]` will translate to `[a-z-(?:(?![b])[abc])]` which is not valid. :( — tjwrona1992, Mar 11 '21 at 14:06
Sorry I didn't know it could be applied recursively. Could you write a complete list of examples? — Bob, Mar 11 '21 at 19:55
Here is a page that describes how character class subtraction is supposed to work with a bit more detail: https://www.regular-expressions.info/charclasssubtract.html, I think the only real thing left to worry about is nested character classes. XML Schema regular expressions also support a couple unique character clases `\i` and `\c` but those are pretty well defined and I can do a similar search/replace to replace them with the corresponding character class. These character classes are described in detail here: https://www.regular-expressions.info/shorthand.html#xml — tjwrona1992, Mar 11 '21 at 20:05
Wow you are awesome! Thank you for putting in all of that time and effort. It's going to take me some time to analyze this and break it all down, but if all goes well this may do the trick! If this all works the only thing left would be to find a way to restrict any features that are valid in an `ECMA` regex that are not valid an an `XMLSchema` regex, but I'm honestly not sure if that is even worth the effort because this should work for nearly all use cases. — tjwrona1992, Mar 11 '21 at 21:41

score 1 · Answer 3 · answered Mar 14 '21 at 12:21

1

Try using a library function from a library with XPath support, like xmlregexp in libxml (is a C library), it can handle the XML regexes and apply them to the XML directly

http://www.xmlsoft.org/html/libxml-xmlregexp.html#xmlRegexp

----> http://web.mit.edu/outland/share/doc/libxml2-2.4.30/html/libxml-xmlregexp.html <----

An alternative could have been PugiXML (C++ library, What XML parser should I use in C++? ) however i think it does not implement the XML regex functionality ...

answered Mar 14 '21 at 12:21

ralf htp

9,149
4
22
34

I'll checkout `libxml` for the `xmlregexp`, that sounds promising! Thanks! – tjwrona1992 Mar 15 '21 at 00:54
It looks like buried *deep* within the `Xerces-C++` documentation there is a pattern match function that may also do what I am looking for: http://xerces.apache.org/xerces-c/apiDocs-3/classXMLString.html#aeecfcbf4663b63758fe7692457b4cc98. I'll give both `Xerces-C++` and `libxml2` a try and see how it goes. – tjwrona1992 Mar 16 '21 at 21:07
I'm accepting this answer because this is the approach I am going to take and this will probably provide the most value to others viewing this question, but I had to give the bounty to the guy who spent all of that time and effort to literally translate XML regular expressions into a grammar that `std::regex` supports. That takes some intense dedication lol – tjwrona1992 Mar 18 '21 at 15:17

score 0 · Accepted Answer · answered Mar 21 '21 at 00:04

Okay after going through the other answers I tried out a few different things and ended up using the xmlRegexp functionality from libxml2.

The xmlRegexp related functions are very poorly documented so I figured I would post an example here because others may find it useful:

#include <iostream>
#include <libxml/xmlregexp.h>

int main()
{
    LIBXML_TEST_VERSION;

    xmlChar* str = xmlCharStrdup("bcdfg");
    xmlChar* pattern = xmlCharStrdup("[a-z-[aeiou]]+");
    xmlRegexp* regex = xmlRegexpCompile(pattern);

    if (xmlRegexpExec(regex, str) == 1)
    {
        std::cout << "Match!" << std::endl;
    }

    free(regex);
    free(pattern);
    free(str);
}

Output:

Match!

I also attempted to use the XMLString::patternMatch from the Xerces-C++ library but it didn't seem to use an XML Schema compliant regex engine underneath. (Honestly I have no clue what regex engine it uses underneath and the documentation for that was pretty abysmal and I couldn't find any examples online so I just gave up on it.)

Regex character class subtraction in C++

4 Answers4

Edit

Edit: Recursive and Named character classes

Linked

Related