0

I'm trying to use a captured group directly in the regex. However, when I try to do this the program hangs indefinitely.

For example:

string input = "<Tag>blahblah</Tag>";
regex r1("<([a-zA-Z]+)>[a-z]+</\1>");
string result = regex_replace(result, regex, "");

If I add another slash to the capture "<([a-zA-Z]+)>[a-z]</\\1>", the program compiles but throws a "regex_error(regex_constants::error_backref)" exception.

Notes:
Compiler: Apple LLVM 5.1
I am using this as part of the process to clean junk from blocks of text. The document is not necessarily HTML/XML and desired text is not always within tags. So if possible, I would like to be able to do this with regular expressions, not a parser.

  • your regex should be `<([a-zA-Z]+)>[a-z]+\1>` – Avinash Raj Sep 05 '14 at 17:29
  • Sorry, adding the plus was an oversight on my part when writing the question. Thanks for the catch; I've edited the code. However, the problem is centered more around the use of the capture than the rest of the regex – user2238231 Sep 05 '14 at 17:33
  • 2
    It looks like you're trying to parse (X)HTML using regex. You really shouldn't use regex for that. – RevanProdigalKnight Sep 05 '14 at 17:36
  • 2
    Which compiler version are you actually using? Note `std::regex` [is broken up to GCC 4.8](http://stackoverflow.com/questions/15059162/c11-regex-matching). – πάντα ῥεῖ Sep 05 '14 at 17:37
  • I'm using Xcode 5.1.1, so I think the Clang 5.1.0 compiler – user2238231 Sep 05 '14 at 18:05
  • Use an XML parser, not regular expressions. XML is a language and may not be suitable for a universal regular expression. – Thomas Matthews Sep 05 '14 at 18:18
  • 1
    I am using this as part of the process to clean junk from blocks of text. The document is not necessarily HTML or XML and the desired text is not always within tags. So using an HTML/XML parser is not a viable solution – user2238231 Sep 05 '14 at 19:00

1 Answers1

0

The backslash character in string literals is an escape character.

Either escape it "<([a-zA-Z]+)>[a-z]+</\\1>" or use a raw literal, R"(<([a-zA-Z]+)>[a-z]+</\1>)"

With that, your program works as you would expect:

#include <regex>
#include <iostream>

int main()
{
    std::string input = "Hello<Tag>blahblah</Tag> World";
    std::regex r1("<([a-zA-Z]+)>[a-z]+</\\1>");
    std::string result = regex_replace(input, r1, "");

    std::cout << "The result is '" << result << "'\n";
}

demo: http://coliru.stacked-crooked.com/a/ae20b09d46f975e9

The exception you're getting with \\1 suggests that your compiler is configured to use GNU libstdc++, where regex was not implemented. Look up how to set it up to use LLVM libc++ or use boost.regex.

Cubbi
  • 46,567
  • 13
  • 103
  • 169
  • Ok I see. The \\1 throwing an error is what made me revert back to the "\1". Since \\1 is correct, my issue is actually the error being thrown. I believe that my compiler is already using libc++, because the command line return is "libc++abi.dylib: terminate called throwing an exception" which then points to the exception. In addition, the regex without the capture works fine using the same compiler – user2238231 Sep 05 '14 at 20:23
  • @user2238231 libc++abi is not the same thing as libc++. Are you using the compiler flag `-stdlib=libc++` ? – Cubbi Sep 05 '14 at 20:38