2

I have the following string:

std::string s("server ('m1.labs.teradata.com') username ('use\\')r_*5') password('u\" er 5') dbname ('default')");

I have used the following code:

int main() {
  std::regex re(R"('[^'\\]*(?:\\[\s\S][^'\\]*)*')");
std::string s("server ('m1.labs.teradata.com') username ('use\\')r_*5') password('u\" er 5') dbname ('default')");
unsigned count = 0;
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), re);
                         i != std::sregex_iterator();
                         ++i)
{
    std::smatch m = *i;
    cout << "the token is"<<"   "<< m.str() << endl;
    count++;
}
cout << "There were " << count << " tokens found." << endl;
return 0;

}

The output of the above string is :

the token is   'm1.labs.teradata.com'
the token is   'use\')r_*5'
the token is   'u" er 5'
the token is   'default'
There were 4 tokens found.

Now if the string s mentioned above in the code is

std::string s("server ('m1.labs.ter\'adata.com') username ('use\\')r_*5') password('u\" er 5') dbname ('default')");

The output becomes:

the token is   'm1.labs.ter'
the token is   ') username ('
the token is   ')r_*5'
the token is   'u" er 5'
the token is   'default'
There were 5 tokens found.

Now the output for both strings different: The expected output is "extract everything between the parenthesis and single quote i.e

the token is   'm1.labs.teradata.com'
the token is   'use\')r_*5'
the token is   'u" er 5'
the token is   'default'
There were 4 tokens found

The regex which I have mentioned in the code is able to extract properly BUT not able to escape "single quotes". It is able to escape ",) etc but not single quote. Can the regex be modified to produce the output I need. Thanks in advance.

user6511542
  • 59
  • 1
  • 7
  • 1
    See [Rules for C++ string literals escape character](https://stackoverflow.com/questions/10220401/rules-for-c-string-literals-escape-character). To define a literal backslash, you must double it inside a non-raw string literal. There are literal strings, and there are string literals that define the literal strings in code. – Wiktor Stribiżew Jul 20 '17 at 07:11
  • The second string doesn't look like it is escaped properly. Should `('m1.labs.ter\'adata.com')` be `('m1.labs.ter\\'adata.com')`? – Galik Jul 20 '17 at 07:28
  • @WiktorStribiżew I understood the explaination, is there any way we can change the regex to escape single quote in the string : suppose the string is ('user/'5') the regex should give me 'user'5' (output should come between the single quotes – user6511542 Jul 20 '17 at 17:36
  • Do you mean you want to get `'a'b'` if you have `"'a'b' text"`? – Wiktor Stribiżew Jul 20 '17 at 17:40
  • @WiktorStribiżew Like I want to extract the code between (' **** ') The '****" should be extracted here. Now suppose I have this string as an input: username ('user\'09') The extracted string with the regex shall be: 'user'09' . So basically the escaping of the single quote should be done. Please let me know if I am not clear. Thanks in advance – user6511542 Jul 20 '17 at 18:01
  • Yes, it is clear, but there are 2 things to mention: 1) `" \' "` = `" ' "` (I hope it is clear), and thus my regex won't work here. 2) To get `'a'b'` if you matched `'a\'b'` you need to remove all backslashes - it will be a post-processing step. – Wiktor Stribiżew Jul 20 '17 at 18:04
  • If you escape a single quote char inside a regular string literal with a single backslash it will be removed when compiling the code since `" \' "` = `" ' "`. If you need to put a literal ``\`` before a single quote, use `"\\'"`. – Wiktor Stribiżew Jul 21 '17 at 07:13
  • @WiktorStribiżew Yes the \\' part is working for you regex. For changing \' to ' is not. And doing the post gressing step as you suggested wont work because my string might contain the \ hich should not be removed. Can we escape ' using (' ') that is if I enter (user''5) ===> 'user'5' ?? – user6511542 Jul 21 '17 at 07:54
  • @user6511542 The `" \' "` is a human error. If you want to match `'user'5'` in `"here is 'user'5'"` you might try [`'([^'\\]*(?:(?:\\[\s\S]|\b'\b)[^'\\]*)*)'`](https://regex101.com/r/2vbq2s/1). But it won't extract `'user'*'` as it asumes the `'` you want to match are inbetween letters/digits/`_`. – Wiktor Stribiżew Jul 21 '17 at 07:59
  • @WiktorStribiżew https://stackoverflow.com/questions/33344700/parse-a-large-string-between-single-quotes-with-escaping?rq=1 In this you have added a comment for Java version , the regex is " '[^']*(?:''[^']*)*' " Can we do the same in CPP regex boost?? – user6511542 Jul 21 '17 at 08:05
  • You do not have to use Boost for this pattern to work, `std::regex` will also do the job (`regex r("'[^']*(?:''[^']*)*'")`) – Wiktor Stribiżew Jul 21 '17 at 08:10
  • @user6511542 The regex you have mentioned works for php. How to make it work in cpp boost – user6511542 Jul 21 '17 at 08:12
  • @WiktorStribiżew And then I can do the post processing step by converting two single quotes to one single quote " ' ' " =>>>> " ' " ?? – user6511542 Jul 21 '17 at 08:16
  • :) Boost is a bit modified version of PCRE (used in PHP and there are implementation for a lot of other languages). The `'[^']*(?:''[^']*)*'` pattern will work the same across JS/Python/C#/Boost/PCRE/Java. Yes, then you would need to replace `''` with `'`. – Wiktor Stribiżew Jul 21 '17 at 08:18
  • @WiktorStribiżew Is there any way that the token which are extracted can be put into a string vector during the for loop? – user6511542 Jul 21 '17 at 08:32
  • Yes, sure, let me update the answer. BTW, do you need the outer `'` in the results? I mean do you need `'user'` or just `user`? – Wiktor Stribiżew Jul 21 '17 at 08:39
  • @WiktorStribiżew Yes i need the outer ' in the results i.e 'user' – user6511542 Jul 21 '17 at 08:46
  • @user6511542. Glad it worked for you. Please also consider upvoting if my answer proved helpful to you (see [How to upvote on Stack Overflow?](http://meta.stackexchange.com/questions/173399/how-to-upvote-on-stack-overflow)) since now you have 15 rep points and are entitled to upvoting. – Wiktor Stribiżew Jul 21 '17 at 08:54
  • @WiktorStribiżew In the demo you have given, I am not able to use boost::sregex_token_iterator(.. How can I do that. Thanks so much for the time. – user6511542 Jul 21 '17 at 10:25
  • Use `std::sregex_token_iterator`. Why use Boost at all here? Show your current code. Also, check [this answer](https://stackoverflow.com/questions/3122344/boost-c-regex-how-to-get-multiple-matches). – Wiktor Stribiżew Jul 21 '17 at 10:26
  • @WiktorStribiżew Yes, I want to do it the same way Jacob has commented in the link given, using boost. I want to use your regex only while extracting the tokens. ' boost::regex re_arg_name(" '[^']*(?:''[^']*)*' "); boost::sregex_token_iterator name_iter_start(argsStr.begin(), argsStr.end(), re_arg_values, 0),name_iter_end; typedef std::vector StringVector; StringVector arg_values; std::copy(value_iter_start, value_iter_end, std::back_inserter(arg_values)); ' argsStr is the string : server ('m1.labs.\\''tera\"da ta.com') username ('us *(er'')5') password('uer 5') .... – user6511542 Jul 21 '17 at 10:41
  • @WiktorStribiżew Basically, if I am trying like that the following is the o/p i am getting : https://ideone.com/qYgk8T – user6511542 Jul 21 '17 at 10:56
  • See [this IDEONE demo](https://ideone.com/7UYgtB). – Wiktor Stribiżew Jul 21 '17 at 11:04
  • @WiktorStribiżew This is the output : the token is 'm1.labs.terada' the token is ') password(' There were 2 tokens found. Which is not expected. :( – user6511542 Jul 21 '17 at 11:10
  • Ok, are [these results](https://ideone.com/kMnovZ) expected? Look, I fear your data are corrupted, and that is a bottleneck for regexes. There might be no solution if you cannot define the **exact character context** for the expected matches. – Wiktor Stribiżew Jul 21 '17 at 11:27
  • @WiktorStribiżew The regex you had given in the example is working fine, but why not with boost? The string I am expecting is between (' '), and I want them in a string . Your regex is really great but I dont know what am I missing out – user6511542 Jul 21 '17 at 11:43

1 Answers1

0

You are using a correct regex I shared yesterday via a comment. It matches single-quoted string literals that may have escaped single quotes inside.

std::regex re(R"('([^'\\]*(?:\\[\s\S][^'\\]*)*)')");
std::string s("server ('m1.labs.teradata.com') username ('u\\'se)r_*5') password('uer 5') dbname ('default')");
unsigned count = 0;
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), re);
                         i != std::sregex_iterator();
                         ++i)
{
    std::smatch m = *i;
    cout << "the token is"<<"   "<< m.str(1) << endl;
    count++;
}
cout << "There were " << count << " tokens found." << endl;

Here is my C++ demo

Note that the literal string ('u\'se)r_*5') should be defined like this with a regular string literal where escape sequences are supported where literal backslashes should be defined with \\:

"('u\\'se)r_*5')"

or with a raw string literal where backslashes denote literal backslashes:

R"(('u\'se)r_*5'))"

The R"(...)" forms the raw string literal.

Pattern details:

  • ' - a single quote
  • [^'\\]* - 0+ chars other than single quote and backslash
  • (?:\\[\s\S][^'\\]*)* - zero or more sequences of:
    • \\[\s\S] - any backslash-escaped char
    • [^'\\]* - 0+ chars other than ' and \
  • ' - a single quote.

Note that to avoid matching the first single quote as an escaped quote you need to tweak the expression as in this snippet:

std::regex re(R"((?:^|[^\\])(?:\\{2})*'([^'\\]*(?:\\[\s\S][^'\\]*)*)')");
std::string s("server ('m1.labs.teradata.com') username ('u\\'se)r_*5') password('uer 5') dbname ('default')");
unsigned count = 0;
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), re);
                         i != std::sregex_iterator();
                         ++i)
{
    std::smatch m = *i;
    cout << "the token is"<<"   "<< m.str(1) << endl;
    count++;
}
cout << "There were " << count << " tokens found." << endl;

The (?:^|[^\\])(?:\\{2})* prefix will match the start of string or any char but \ and then 0+ sequences of 2 \, so no escaped ' will be grabbed at first.

And finally, if you just need to get a list of matches into a vector, you may use

#include <iostream>
#include <string>
#include <vector>
#include <regex>

using namespace std;

int main() {
    std::regex rx("'[^']*(?:''[^']*)*'");
    std::string sentence("server ('m1.labs.\\''tera\"da  ta.com') username ('us *(er'')5') password('uer 5') dbname ('default')");
    std::vector<std::string> names(std::sregex_token_iterator(sentence.begin(), sentence.end(), rx),
                               std::sregex_token_iterator());

    for( auto & p : names ) cout << p << endl;
    return 0;
}

See the C++ demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563