0

Regular expression: “[^”]*“

String: “lips“

Result: match

String: “lips’“

Result: not match

I expect both strings to match.

C++ code:

#include <iostream>
#include <string>
#include <boost/regex.hpp>

using namespace std;
using namespace boost;

int main()
{
    const string s1 = "“lips“";
    const string s2 = "“lips’“";
    if (regex_search(s1, regex("“[^”]*“"))) cout << "s1 matched" << endl;
    if (regex_search(s2, regex("“[^”]*“"))) cout << "s2 matched" << endl;
    return 0;
}

output: s1 matched

Is the symbol special ? Why is the second string not matching?

Alex
  • 1,047
  • 8
  • 21
  • not special, s1 and s2 match. https://regex101.com/r/iPMj5C/1 –  May 22 '20 at 18:20
  • Boost regex not use pcre, but boost isnnot coltrolled like perl stuff, But does not look like wide char default so must be encoded as utf8. what compiler for boost ? –  May 22 '20 at 18:30
  • You might want to clarify that you're intentionally using non-ASCII quotation characters (201c LEFT DOUBLE QUOTATION MARK, 201d RIGHT DOUBLE QUOTATION MARK, 2019 RIGHT SINGLE QUOTATION MARK). I initially thought you might be using ASCII QUOTATION MARK and APOSTROPHE characters and something was incorrectly translating them to "smart quotes". Or if you really are using the ASCII characters, edit your question. – Keith Thompson May 22 '20 at 18:34
  • it always better to use hex prop when describing regex literal chars either in the control range or unicode range (> U+00100) this way at least no engine ambiguity. target literals are different story –  May 22 '20 at 18:39
  • I use gcc for c++ – Alex May 22 '20 at 19:18
  • Yes, I use non ascii symbols – Alex May 22 '20 at 19:20
  • In general C++ source code should be ASCII. If you use the _universal_ syntax you can put unicode chars in string literals. `const string s1 = "\u201clips\u201c"; const string s2 = "\u201clips\u2019\u201c";` And for boost regex, its always better to use the `\x{}` notation for unicode chars (>= U+0100) `boost::regex("\\x{201c}[^\\x{201c}]*\\x{201c}")` Telling gcc it is a unicode project will read the ascii strings so the `\u` notation will be properly converted to unicode characters. Note that the regex `"\\x{}"` notation will be converted to `\x{}` and sent to the engine constructor. –  May 23 '20 at 17:59
  • If you really have the need to incorporate unicode char literals in your source code, it can be tricky to maintain. See this for some ideas though https://stackoverflow.com/questions/331690/using-unicode-in-c-source-code –  May 23 '20 at 18:01
  • What if I need to match string from memory, not as literals? Is it possible to recompile boost regex to work for utf-8? – Alex May 23 '20 at 18:07
  • 1
    See https://www.boost.org/doc/libs/1_65_0/libs/regex/doc/html/boost_regex/ref/non_std_strings/icu/unicode_algo.html and specifically https://www.boost.org/doc/libs/1_65_0/libs/regex/doc/html/boost_regex/ref/non_std_strings/icu/intro.html –  May 23 '20 at 18:19
  • You can _natively_ use the wide utf16 or utf8 char without using ICU. It uses ICU for conversion utilities of the target string to different UTF-8/16/32 as needed as well as extra _properties_ `\p{}` available. You can tell gcc to use wide char natively with boost regex, then use the existing library. Or, you can include the regex source in your project then you don't need the boost library. Note this is possibly not portable. –  May 23 '20 at 18:27
  • Or, do what alot have done and go to PCRE which is much more flexible. But this requires doing alot of your own Unicode conversions yourself. –  May 23 '20 at 18:28
  • what is the library name for PCRE with utf8 support for c++? – Alex May 23 '20 at 18:31
  • I used to know a long time ago, but forgot. Google it. I always thought you have to include the source code for PCRE in your project, but I'm sure there is a lib by now. –  May 23 '20 at 18:31

1 Answers1

0

boost regex library does not use utf-8 by default. utf-8 quote symbol and apostrophe have common byte, that`s why regex does not work. Code for utf-8:

#include <iostream>
#include <string>
#include <boost/regex.hpp>
#include <boost/regex/icu.hpp>

using namespace std;
using namespace boost;

int main()
{
    const string s1 = "“lips“";
    const string s2 = "“lips’“";
    if (u32regex_search(s1, make_u32regex("“[^”]*“"))) cout << "s1 matched" << endl;
    if (u32regex_search(s2, make_u32regex("“[^”]*“"))) cout << "s2 matched" << endl;
    return 0;
}

compilation: g++ -std=c++11 ./test.cc -licuuc -lboost_regex

output:

s1 matched
s2 matched
Alex
  • 1,047
  • 8
  • 21