-1

Hello this is my string in c++

data1(" value 1 ") data2 ("value 2") anything3("  data3("value") ")

and this is my regex

regex Rgx(R"~((\w+)\s*[(]\s*["]([^"]*)["]\s*[)])~");

i want to use c++ regex (search) and get

data1
data2
anything3

and

 value 1 
value 2
  data3("value") 

but my result is

data1(" value 1 ")
data1
 value 1

data2 ("value 2")
data2
value 2

data3("value")
data3
value

what is the problem ? i want this regex just get everything in this style out off ""

i want to change this regex to get data3("value") but first check data3("value") is not between ""...

in pcre, i see this regex that can pass all "" inside of ""

\h*(.*?)\h*[(]\h*"(.*?(?:[\\\\]".*?)*)"[)]\h*([,|.*?])

this is work for

key1("val1"), key2("val2"), key3(" key4("val3") ret "),

and you can check this result

0 => array(

    0=>key1("val1"),
    1=> key2("val2"),
    2=> key3(" key4("val3") ret "),
)

1 => array(

    0=>key1
    1=>key2
    2=>key3
)

2 => array(

    0=>val1
    1=>val2
    2=> key4("val3") ret
)

i need something like this to skip all "" inside of ""

Elh48
  • 43
  • 8
  • Actually the only snippet that parses as C++ is the one with the regex, and even there I'm not sure. Can you provide a complete (but still minimal) example? – Ulrich Eckhardt Mar 31 '16 at 19:45
  • this is result of each regex_search process in while – Elh48 Mar 31 '16 at 19:49
  • 1) `std::regex` does not support recursion, Boost does, 2) If there is only one possible nested level, you can use `std::regex` with [`R"~((\w+)\s*\(\s*"([^"]*(?:\s*\w+\("[^"]*"\)\s*)*)"\s*\))~"`](https://regex101.com/r/nE7fK5/1) regex. – Wiktor Stribiżew Mar 31 '16 at 19:54
  • What engine you are using? There is a balanced text element in your string that requires some sort of recursion. Use boost::regex. –  Mar 31 '16 at 19:54
  • Don't try parsing non-regular languages with regular expressions. – 5gon12eder Mar 31 '16 at 20:06
  • My question is easy ..... i just need to know how can i limit my regex to get just between " " – Elh48 Mar 31 '16 at 20:15
  • Not really that easy, what you are trying to do will match `anything3(" data3("value")` If you say `[^"]` you will not be able to get `anything3(" data3("value") ")` which contains a _nested_ set of delimiters. –  Mar 31 '16 at 20:20
  • Wiktor Stribiżew your regex not worked – Elh48 Mar 31 '16 at 20:27
  • @5gon12eder I'd say this is a pretty regular language. If C++ used a regex syntax that supported recursion, I'd say this was an excellent candidate. As it is we'll have to go to Boost for that or parse by hand. – Jonathan Mee Apr 01 '16 at 13:46
  • @JonathanMee Maybe I didn't understand the requirements correctly but parsing arbitrarily nested parenthesis and quotes is certainly not regular. I'm also not sure what a “pretty regular language” is. But indeed, most modern regex engines allow parsing more languages than just strictly regular ones. – 5gon12eder Apr 01 '16 at 17:20
  • @5gon12eder https://en.wikipedia.org/wiki/Recursive_language: "All regular, context-free and context-sensitive languages are recursive." And yes, I do view http://en.wikipedia.org as the ultimate source of truth. – Jonathan Mee Apr 01 '16 at 18:07
  • @JonathanMee I don't see your point. While it is true that any regular language is recursive, this doesn't mean that any recursive language is also regular. Quite clearly, it is not. – 5gon12eder Apr 01 '16 at 18:26
  • @5gon12eder I'm trying to point out that regular languages *are* recursive. The problem is that ECMAScript is not currently capable of modeling regular languages. But many regular expressions are, PCRE, for instance. The point being the OP is *not*, "parsing non-regular languages." – Jonathan Mee Apr 01 '16 at 18:35
  • @JonathanMee As I've said, I might not have understood the requirements correctly. But any language involving correctly nested parenthesis (at any depth) is certainly not regular. It might be context-free, though. – 5gon12eder Apr 01 '16 at 18:38
  • @5gon12eder "Any language involving correctly nested parenthesis (at any depth) is certainly not regular" Do you have a source for that? I'd contend that it is regular, hence my http://en.wikipedia.org quote. – Jonathan Mee Apr 01 '16 at 18:41
  • @JonathanMee It is known as the [Dyck language](https://en.wikipedia.org/wiki/Dyck_language) and is a classical teaching example for a non-regular but context-free language. (Unfortunately, its Wikipedia article is a bit less than useful.) You can prove its non-regularity for example by using the [pumping lemma](https://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages). There is a sketch of an alternative prove [here](https://en.wikipedia.org/wiki/Regular_language#The_number_of_words_in_a_regular_language). – 5gon12eder Apr 01 '16 at 18:57

1 Answers1

0

What you're looking for is regex recursion. That's not supported by C++'s regex engine (ECMAScript). So if you're going to parse a string that has recursion in C++, you'll either need Boost or you'll have to do it by hand.

Since I'd always encourage using the language where possible, I'll show you how to do this without Boost.

We'll need 2 functions, first one to find a non-escaped char:

template <typename T>
T findNonEscaped(T start, T end, const char ch) {
    T result = find(start, end, ch);

    while (result != end && result[-1] == '\\') result = find(start, end, ch);
    return result;
}

And second we'll need a function like this to extract nested parenthesis:

template <typename T>
T extractParenthesis(T start, T end) {
    T finish = findNonEscaped(start, end, ')');

    for (auto i = findNonEscaped(next(start), end, '('); i != end && i < finish; i = findNonEscaped(next(i), end, '(')) finish = findNonEscaped(next(finish), end, ')');
    return finish;
}

Finally, given the input line: const auto input = "data1(\" value 1 \") data2 (\"value 2\") anything3(\" data3(\"value\") \")"s; we can use those 2 functions to write this:

map<string, string> output;

for (auto openParenthesis = findNonEscaped(input.cbegin(), input.cend(), '('), closeParenthesis = input.cbegin(); openParenthesis != input.cend(); openParenthesis = findNonEscaped(openParenthesis, input.cend(), '(')) {
    decltype(output)::key_type key;
    istringstream ss{ string{ make_reverse_iterator(openParenthesis), make_reverse_iterator(closeParenthesis) } };

    ss >> key;
    closeParenthesis = extractParenthesis(openParenthesis, input.cend());
    output[decltype(output)::key_type{ key.crbegin(), key.crend() }] = decltype(output)::mapped_type{ next(findNonEscaped(next(openParenthesis), closeParenthesis, '"')), prev(findNonEscaped(make_reverse_iterator(closeParenthesis), make_reverse_iterator(next(openParenthesis)), '"').base()) };
    openParenthesis = closeParenthesis;
}

Live Example

This code is pretty resilient, the only defect I know of is that for an invalid input like const auto input = "key1(\"value1\"\"value2\")" it will return:

key1 : value1""value2

I know some of this iterator functionality is a bit more... advanced. So if you have specific questions feel free to let me know in the comments.

Community
  • 1
  • 1
Jonathan Mee
  • 37,899
  • 23
  • 129
  • 288