0

This is my string:

/*
  Block1 {

    anythinghere
  }
*/

// Block2 { }
# Block3 { }

Block4 {

    anything here
}

I am using this regex to get each block name and inside content.

regex e(R"~((\w+)\s+\{([^}]+)\})~", std::regex::optimize);

But this regex gets all inside of description too. There is a “skip” option in PHP that you can use to skip all descriptions.

What_I_want_to_avoid(*SKIP)(*FAIL)|What_I_want_to_match

But this is C++ and I cannot use this skip method. What should I do to skip all descriptions and just get Block4 in C++ regex?

This regex detects Block1, Block2, Block3 and Block4 but I want to skip Block1, Block2, Block3 and just get Block4 (skip descriptions). How do I have to edit my regex to get just Block4 (everything outside the descriptions)?

5gon12eder
  • 24,280
  • 5
  • 45
  • 92
BasicYard
  • 13
  • 2
  • 4
    It looks like you try to accomplish something, using regular expressions, that should be accomplished by a parser. Having said that, it is not at all clear from your question what you actually want to match. – Tommy Andersen Feb 24 '16 at 17:23
  • What do you mean by "skip all descriptions"? Are you trying not to match comments? – Kyle A Feb 24 '16 at 17:27
  • yes try to not match comments – BasicYard Feb 24 '16 at 17:28
  • You can skip the comments, but a note that `; Block3 { hi }` is invalid C++ if nothing exists before the first comment. –  Feb 24 '16 at 17:29
  • Do your language allow nested comments? Will the comment character inside strings count as comments (e.g. "tets/*test*/")? can block name and opening bracket be on each line? – Tommy Andersen Feb 24 '16 at 17:31
  • Actually you have to not only parse comments, but parse strings outside of comments as well, if for example your strings could span lines. –  Feb 24 '16 at 17:31
  • could you edit my regex with what you said? – BasicYard Feb 24 '16 at 17:32
  • Be careful. In c++, `;` does not start a comment and neither does `#`. The `;` ends statements and `#` signals the preprocessor (as in `#include` statements). Comments in c++ either start with `//` or are surrounded by `/*` and `*/`. – Kyle A Feb 24 '16 at 17:34
  • 1
    _You can_ parse the blocks, _but_ the part between `{ ...}` can be subtly masked within a string like `{ ..."}";....}`. The way you would have to do it is match comments or strings or your block as a single expression, i.e. `comments|strings|(block{(?:comments|strings|.)*})` then look for capture group 1 to match. I could give you the regex, but it's very big. –  Feb 24 '16 at 17:41
  • please give me the full regex i'm confounded – BasicYard Feb 24 '16 at 17:57

2 Answers2

1

Tl;DR: Regular expressions cannot be used to parse full blown computer languages. What you want to do cannot be done with regular expressions. You need to develop a mini-C++ parser to filter out comments. The answer to this related question might point you in the right direction.

Regex can be used to process regular expressions, but computer languages such as C++, PHP, Java, C#, HTML, etc. have a more complex syntax that includes a property named "middle recursion". Middle recursion includes complications such as an arbitrary number of matching parenthesis, begin / end quotes, and comments that can contain symbols

If you want to understand this in more detail, read the answers to this question about the difference between regular expressions and context free grammars. If you are really curious, enroll in a Formal Language Theory class.

Community
  • 1
  • 1
Jay Elston
  • 1,978
  • 1
  • 19
  • 38
1

Since you requested this long regex, here it is.

This will not handle nested Blocks like block{ block{ } }
it would match block{ block{ } } only.

Since you specified you are using C++11 as the engine, I didn't use
recursion. This is easily changed to use recursion say if you were to use
PCRE or Perl, or even BOOST::Regex. Let me know if you'd want to see that.

As it is it's flawed, but works for your sample.
Another thing it won't do is parse Preprocessor Directives '#...' because
I forgot the rules for that (thought I did it recently, but can't find a record).

To use it, sit in a while ( regex_search() ) loop looking for a match on
capture group 1, if (m[1].success) etc.. That will be your block.
The rest of the matches are for comments, quotes, or non-comments, unrelated
to the block. These have to be matched to progress the match position.

The code is long and redundant because there is no function calls (recursion) in the C++11 EMCAscript. Like I said, use boost::regex or something.

Benchmark

Sample:

/*
  Block1 {

    anythinghere
  }
*/

// Block2 { }

Block4 {

   // CommentedBlock{ asdfasdf }
    anyth"}"ing here
}

Block5 {

   /* CommentedBlock{ asdfasdf }
    anyth}"ing here
   */
}

Results:

Regex1:   (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})|[\S\s](?:(?!\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})[^/"'\\])*)
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   8
Elapsed Time:    1.95 s,   1947.26 ms,   1947261 µs

Regex Explained:

    # Raw:        (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})|[\S\s](?:(?!\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})[^/"'\\])*)
    # Stringed:  "(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(\\w+\\s*\\{(?:(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\})[\\S\\s][^}/\"'\\\\]*))*\\})|[\\S\\s](?:(?!\\w+\\s*\\{(?:(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\})[\\S\\s][^}/\"'\\\\]*))*\\})[^/\"'\\\\])*)"     


    (?:                              # Comments 
         /\*                              # Start /* .. */ comment
         [^*]* \*+
         (?: [^/*] [^*]* \*+ )*
         /                                # End /* .. */ comment
      |  
         //                               # Start // comment
         (?: [^\\] | \\ \n? )*?           # Possible line-continuation
         \n                               # End // comment
    )
 |                                 # OR,

    (?:                              # Non - comments 
         "
         [^"\\]*                          # Double quoted text
         (?: \\ [\S\s] [^"\\]* )*
         "
      |  '
         [^'\\]*                          # Single quoted text
         (?: \\ [\S\s] [^'\\]* )*
         ' 
      |  
         (                                # (1 start), BLOCK
              \w+ \s* \{               
              ####################
              (?:                              # ------------------------
                   (?:                              # Comments  inside a block
                        /\*                             
                        [^*]* \*+
                        (?: [^/*] [^*]* \*+ )*
                        /                                
                     |  
                        //                               
                        (?: [^\\] | \\ \n? )*?
                        \n                               
                   )
                |  
                   (?:                              # Non - comments inside a block
                        "
                        [^"\\]*                          
                        (?: \\ [\S\s] [^"\\]* )*
                        "
                     |  '
                        [^'\\]*                          
                        (?: \\ [\S\s] [^'\\]* )*
                        ' 
                     |  
                        (?! \} )
                        [\S\s]                          
                        [^}/"'\\]*                      
                   )
              )*                               # ------------------------
              #####################          
              \}                               
         )                                # (1 end), BLOCK

      |                                 # OR,

         [\S\s]                           # Any other char
         (?:                              # -------------------------
              (?!                              # ASSERT: Here, cannot be a BLOCK{ }
                   \w+ \s* \{                      
                   (?:                              # ==============================
                        (?:                              # Comments inside a block
                             /\*                              
                             [^*]* \*+
                             (?: [^/*] [^*]* \*+ )*
                             /                                
                          |  
                             //                               
                             (?: [^\\] | \\ \n? )*?
                             \n                               
                        )
                     |  
                        (?:                              # Non - comments inside a block
                             "
                             [^"\\]*                          
                             (?: \\ [\S\s] [^"\\]* )*
                             "
                          |  
                             '
                             [^'\\]*                          
                             (?: \\ [\S\s] [^'\\]* )*
                             ' 
                          |  
                             (?! \} )
                             [\S\s]                          
                             [^}/"'\\]*                       
                        )
                   )*                               # ==============================
                   \}                               
              )                                # ASSERT End

              [^/"'\\]                         # Char which doesn't start a comment, string, escape,
                                               # or line continuation (escape + newline)
         )*                               # -------------------------
    )                                # Done Non - comments