0

I'm trying to write regex expressions to validate XML files and extract the strings stored between tags in C++.

This is one of the regex expressions I'm aiming for:

"<[^/]*?>"

This doesn't work however. Neither does something simpler like this:

 "<[a-z]*>"

However, this produces a match:

 "<.*>"

It doesn't seem like brackets are able to be matched.

Below is the relevant part of the code I'm using:

string testString = "<test>";

regex xmlRegOpenTag("<[^/]*?>", regex_constants::extended); 
smatch smOpen;
cout << regex_match(testString, smOpen, xmlRegOpenTag) << endl;

string openCap = smOpen[0];
cout << "openCap: " << openCap << endl;

I've tried using other flags like regex_constants::basic, etc. Nothing seems to be working. I'm compiling using gcc version 4.7.3.

To those mentioning that I shouldn't be parsing XML using regex: I only need to parse XML files that I've created myself, so it isn't a problem.

I'm using the C++11 standard. In my header file, I'm including regex as such:

#include <regex>
using namespace std;

When using the first regex expression ("<[^/]*?>"), I get:

terminate called after throwing an instance of 'std::regex_error'
  what():  regex_error
Abort

When using the second regex expression ("<[a-z]*>"), I get:

0
openCap: 

When using the third regex expression ("<.*>"), I get:

1
openCap: <test>

This is the information I can provide about the compiler I'm using:

Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.7.3-1ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --with-system-zlib --enable-objc-gc --with-cloog --enable-cloog-backend=ppl --disable-cloog-version-check --disable-ppl-version-check --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-1ubuntu1) 
Tagc
  • 8,736
  • 7
  • 61
  • 114
  • 1
    Be warned: [You can't parse XML with regex](http://stackoverflow.com/a/1732454/395760). –  Jan 18 '14 at 16:52
  • @delnan, I got the functionality that I wanted using the QT libraries, but I need to rewrite the code to work in standard C++. The XML files that I need to parse are ones that I create for a very specific purpose. – Tagc Jan 18 '14 at 16:57
  • gcc version 4.7.3, ... with your code, I've an exception for regex `<[^/]*>`. With gcc 4.8.2 too. Exception regex_error. – ShinTakezou Jan 18 '14 at 17:13
  • @ShinTakezou That's the error I get. – Tagc Jan 18 '14 at 17:17
  • 1
    Possible duplicate of [Is gcc 4.8 or earlier buggy about regular expressions?](https://stackoverflow.com/questions/12530406/is-gcc-4-8-or-earlier-buggy-about-regular-expressions) – Trevor Boyd Smith Dec 05 '18 at 15:26
  • FYI your regex feature is only supported for g++ version 4.9.2 or greater. – Trevor Boyd Smith Dec 05 '18 at 15:27

3 Answers3

2

First of all, XML is not a regular language and you shouldn't try to use RegExes to parse it, eventually it will give you some real bad head aches, you should rather use one of the available parsers for XML. For example say you have something such as "<foo><bar /></foo>", something such as <.*>will match the whole string and not just the first tag, but the whole string. You can try to use 'lazy' matching with <.*?>, which tries to match as little characters as possible, but that might still break if you have an >inside a string in a property, for example.

Now, let's just pretend that parsing XML with RegExes wouldn't be a problem: all the RegExes you gave should match <test> and do so in the implementations I tried, which suggests that there is a bug in your code or the library you use, but I don't see one in your code and the standard implementation of regex shouldn't be buggy either...

EDIT: I just tried in C++ and the RegExes work as well. In a minimalist implementation

regex reg("<[^/]*>");
if (regex_match("<test>", reg))
    cout << "Matched..." << endl;
else
    cout << "Didn't match..." << endl;

yields the output "Matched..." - and <[a-z]*> works as well. I used clang-500.2.79 in this expirement. This basically confirms that the implementation supplied with your compiler is faulty.

Cu3PO42
  • 1,403
  • 1
  • 11
  • 19
  • You can assume that parsing XML with regex won't be a problem for me, since I'm only using it to parse XML files that I create. I've tested the same regex using the QT libraries and Python, and it works there. I'm compiling using the C++11 standard, and as for regex, I'm including it in the header file as simply #include , using namespace std. – Tagc Jan 18 '14 at 17:09
  • 1
    @Tagc specify compiler. In gcc, for example, was non-functional stubs until about 4 months ago. – Cubbi Jan 18 '14 at 17:30
  • @Cubbi I'm using gcc version 4.7.3. I've updated my post with additional information. – Tagc Jan 18 '14 at 18:04
0

The regex you tried

[^/]* indicates any character except: '/' (0 or more times (matching the most amount possible))

[a-z]* indicates any character of: 'a' to 'z' (0 or more times (matching the most amount possible))

.* indicate any character (0 or more times (matching the most amount possible))

Rakesh KR
  • 6,357
  • 5
  • 40
  • 55
  • I think he already knows that. He's trying to figure out why `test` matches `.*`, but not `[^/]*` or `[a-z]*`. – Gabe Jan 18 '14 at 17:00
  • Yes. The first expression is what I want to use (or rather, it should have been "[^/]*?" (non-greedy matching), and I will update the original post to include that). I know the regex is correct, since it works in another version of the code I've written in QT, and in Python as well. – Tagc Jan 18 '14 at 17:00
0

I had the same problem. It appears character set matching (with square brackets) is broken in gcc4.x with the default ECMA script syntax. Using std::regex:extended parser seem to work. i.e

std::regex re(".*", std::regex::ECMAScript); -> ok
std::regex re("[a-z]", std::regex::ECMAScript); -> regex_error
std::regex re("[a-z]", std::regex::extended); -> ok