Why can't regex find the "(" in a Japanese string in C++?

Question

I have a huge file of Japanese example sentences. It's set up so that one line is the sentence, and then the next line is comprised of the words used in the sentence separated by {}, () and []. Basically, I want to read a line from the file, find only the words in the (), store them in a separate file, and then remove them from the string.

I'm trying to do this with regexp. Here is the text I'm working with:

は 二十歳(はたち){２０歳} になる[01]{になりました}

And here's the code I'm using to find the stuff between ():

std::smatch m;
std::regex e ("\(([^)]+)\)");   // matches things between ( and )

if (std::regex_search (components,m,e)) {
   printToTest(m[0].str(), "what we got"); //Prints to a test file "what we got: " << m[0].str()
   components = m.prefix().str().append(m.suffix().str());
   //commponents is a string
   printToTest(components, "[COMP_AFTER_REMOVAL]");
   //Prints to test file "[COMP_AFTER_REMOVAL]: " << components 
}

Here's what should get printed:

what we got:はたち
[COMP_AFTER_REMOVAL]:は 二十歳(){２０歳} になる[01]{になりました}

Here's what gets printed:

what we got:は 二十歳(はたち
[COMP_AFTER_REMOVAL]:){２０歳} になる[01]{になりました}

It seems like somehow the は is being confused for a (, which makes the regexp go from は to ). I believe it's a problem with the way the line is being read in from the file. Maybe it's not being read in as utf8 somehow. Here's what I do:

xml_document finalDoc;
string sentence;
string components;
ifstream infile;

infile.open("examples.utf");
unsigned int line = 0;
string linePos;
bool eof = infile.eof();
while (!eof && line < 1){       
    getline(infile, sentence);
    getline(infile, components);
    MakeSentences(sentence, components, finalDoc);
    line++;
}

Is something wrong? Any tips? Need more code? Please help. Thanks.

Probably this can be an answer http://stackoverflow.com/questions/11254232/do-c11-regular-expressions-work-with-utf-8-strings — SGrebenkin, Jan 22 '15 at 16:33
Regexes on UTF-8 don't make a lot of sense unless they are some third party library that handles UTF-8. — Jonathan Mee, Jan 22 '15 at 16:36
I've tried searching directly through the line to find the parens, but it gave me similar results. It really seems like は and ( are somehow interpreted the same way somewhere along the line. — Shaquil Hansford, Jan 23 '15 at 18:03

score 2 · Answer 1 · answered Jan 22 '15 at 19:04

2

You forgot to escape your backslashes. The compiler sees "\(([^)]+)\)" and interprets it as (([^)]+)) which is not the regex you wanted.

You need to type "\\(([^)]+)\\)"

answered Jan 22 '15 at 19:04

Karol S

9,028
2
32
45

Or use a raw string literal: `R"(\(([^)]+)\))"` (though that's gets a little confusing with all the parens) – Cornstalks Jan 22 '15 at 19:07
Thanks to both of you, but I tried both solutions and got the same results. Any idea why that might be? – Shaquil Hansford Jan 23 '15 at 18:02

Why can't regex find the "(" in a Japanese string in C++?

1 Answers1