1

I'm trying to write a regular expression in C++ to match a base64 encoded string. I'm quite familiar with writing complex regular expressions in Perl so I started with that:

use strict;
use warnings;

my $base64_regex = qr{
(?(DEFINE)
    (?<B64>[A-Za-z0-9+/])
    (?<B16>[AEIMQUYcgkosw048])
    (?<B04>[AQgw])
)
^(
    ((?&B64){4})*
    (
        (?&B64){4}|
        (?&B64){2}(?&B16)=|
        (?&B64)(?&B04)={2}
    )
)?$}x;

# "Hello World!" base64 encoded 
my $base64 = "SGVsbG8gV29ybGQh";

if ($base64 =~ $base64_regex)
{
    print "Match!\n";
}
else
{
    print "No match!\n"
}

Output:

Match!

I then tried to implement a similar regular expression in C++:

#include <iostream>
#include <regex>

int main()
{
    std::regex base64_regex(
        "(?(DEFINE)"
            "(?<B64>[A-Za-z0-9+/])"
            "(?<B16>[AEIMQUYcgkosw048])"
            "(?<B04>[AQgw])"
        ")"
        "^("
            "((?&B64){4})*"
            "("
                "(?&B64){4}|"
                "(?&B64){2}(?&B16)=|"
                "(?&B64)(?&B04)={2}"
            ")"
        ")?$");

    // "Hello World!" base64 encoded 
    std::string base64 = "SGVsbG8gV29ybGQh";

    if (std::regex_match(base64, base64_regex))
    {
        std::cout << "Match!" << std::endl;
    }
    else
    {
        std::cout << "No Match!" << std::endl;
    }
}

but when I run the code I get an exception telling me it is not a valid regular expression.

enter image description here

Catching the exception and printing the "what" string doesn't help much either. All it gives me is the following:

regex_error(error_syntax)

Obviously I could get rid of the "DEFINE" block with my pre-defined subpatterns, but that would make the whole expression very difficult to read... and, well... I like to be able to maintain my own code when I come back to it a few years later lol so that isn't really a good option.

How can I get a similar regular expression to work in C++?

Note: This must all be done within a single "std::regex" object because I am writing a library where users will be able to pass a string to be able to define their own regular expressions and I want these users to be able to "DEFINE" similar subexpressions within their regex if they need to.

ikegami
  • 367,544
  • 15
  • 269
  • 518
tjwrona1992
  • 8,614
  • 8
  • 35
  • 98
  • Take a look at [this post](https://stackoverflow.com/questions/16886992/c11-regex-capture-groups-by-name) – SeaBean Mar 04 '21 at 03:04
  • You can also try whether (?P etc. works – SeaBean Mar 04 '21 at 03:10
  • At the very least, you can build the pattern using interpolation/concatenation (since none of them are recursive) – ikegami Mar 04 '21 at 03:24
  • Recursive regex isn't supported in C++, but Boost can help. See [this post](https://stackoverflow.com/q/29397066/4653379) and [this post](https://stackoverflow.com/q/56622828/4653379) for example, and here's a [boost page](https://www.boost.org/doc/libs/1_61_0/libs/xpressive/doc/html/boost_xpressive/user_s_guide/grammars_and_nested_matches.html) – zdim Mar 04 '21 at 03:29
  • @zdim, I'm not looking for recursive though, I just need to be able to define named subexpressions. Honestly I'd think this would be simpler than recursion, but I can't figure out how to get it to work. – tjwrona1992 Mar 04 '21 at 04:43
  • @tjwrona1992 Right, but I'm thinking that even just the syntax itself drops it from C++. (I looked and can't find "named pattern" or "DEFINE" ... or how would one call it in c++ ?) I'd expect Boost to be able to do what you need however, if you can use it...? – zdim Mar 04 '21 at 05:17
  • @tjwrona1992 Yeah, it all goes [in boost::regex](https://www.boost.org/doc/libs/1_75_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.conditional_expressions), including Conditional (w/ DEFINE), recursive, etc – zdim Mar 04 '21 at 05:23
  • On [this boost page](https://www.boost.org/doc/libs/1_57_0/libs/regex/doc/html/boost_regex/ref/syntax_option_type/syntax_option_type_perl.html) it is stated that _ECMAScript_ (what C++ uses I believe) is "functionally identical" to Perl regex syntax. Perhaps syntax is identical but just doesn't support as much? – zdim Mar 04 '21 at 05:28
  • Found a [page on MS docs](https://learn.microsoft.com/en-us/cpp/standard-library/regular-expressions-cpp?view=msvc-160) which seems to spell out a complete account of what is available, confirming that named patterns (or subexpressions or whatever they'd be called) _are not_. Not sure how definitive this is but I did find other accounts of what ECMAScript supports and named patterns aren't there. Sorry :( Either use Boost or use normal captures. – zdim Mar 04 '21 at 06:09
  • @zdim, Looking at the boost documentation it does look like boost supports a "Perl" flavor of regex that may support this. I was hoping to avoid using boost because it adds considerable bloat to the project when I only need a small piece, but I will give it a try. – tjwrona1992 Mar 04 '21 at 19:34
  • @tjwrona1992 Yes, it even lists it specifically at one of the links in my comments above. The page is titled [The Perl regular expression syntax](https://www.boost.org/doc/libs/1_75_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.conditional_expressions), and that means the syntax used in Boost. (Search for `DEFINE`...) I've seen elsewhere -- it's in another link somehwere in this thread -- that it allows recursive regex (I mean to say, if it has _that_ then it has everything that there is :). I understand the point about the size... – zdim Mar 04 '21 at 20:57
  • @zdim, The boost "Perl" regex implementation worked great! Thanks for the help. Feel free to post it as an official answer if you want and I will accept it. – tjwrona1992 Mar 05 '21 at 03:51
  • 1
    @tjwrona1992 Great :) Thanks for letting me know, it is good to know that it worked nicely. Thank you for offering that I write it up, but I'd suggest rather that you post what you did (and "accept" it) --- it'd be good to have an answer here. – zdim Mar 05 '21 at 04:06

2 Answers2

1

How about string concatenation?

#define B64 "[A-Za-z0-9+/]"
#define B16 "[AEIMQUYcgkosw048]"
#define B04 "[AQgw]"

std::regex base64_regex(
        "^("
            "(" B64 "{4})*"
            "("
                B64 "{4}|"
                B64 "{2}" B16 "=|"
                B64 B04 "={2}"
            ")"
        ")?$");
Jarod42
  • 203,559
  • 14
  • 181
  • 302
  • That does look quite clean, but unfortunately it won't work for my use case. Ultimately I am writing a library that will allow users to provide their own regular expression strings from within an XML file and I want users to be able to "DEFINE" subexpressions themselves. Because of this, it all has to be done within a single string. – tjwrona1992 Mar 04 '21 at 19:32
1

I took a suggestion from the comments and checked out "boost" regex since it supports "Perl" regular expressions. I gave it a try and it worked great!

#include <boost/regex.hpp>

boost::regex base64_regex(
    "(?(DEFINE)"
        "(?<B64>[A-Za-z0-9+/])"
        "(?<B16>[AEIMQUYcgkosw048])"
        "(?<B04>[AQgw])"
    ")"
    "("
        "((?&B64){4})*"
        "("
            "(?&B64){4}|"
            "(?&B64){2}(?&B16)=|"
            "(?&B64)(?&B04)={2}"
        ")"
    ")?", boost::regex::perl);
tjwrona1992
  • 8,614
  • 8
  • 35
  • 98