std regex_search to match only current line

Question

I use a various regexes to parse a C source file, line by line. First i read all the content of file in a string:

ifstream file_stream("commented.cpp",ifstream::binary);

std::string txt((std::istreambuf_iterator<char>(file_stream)),
std::istreambuf_iterator<char>());

Then i use a set of regex, which should be applied continusly until the match found, here i will give only one for example:

vector<regex> rules = { regex("^//[^\n]*$") };

char * search =(char*)txt.c_str();

int position = 0, length = 0;

for (int i = 0; i < rules.size(); i++) {
  cmatch match;

  if (regex_search(search + position, match, rules[i],regex_constants::match_not_bol | regex_constants::match_not_eol)) 
  {
     position += ( match.position() + match.length() );        
  }

}

But it don't work. It will match the comment not in the current line, but it will search whole string, for the first match, regex_constants::match_not_bol and regex_constants::match_not_eol should make the regex_search to recognize ^$ as start/end of line only, not end start/end of whole block. So here is my file:

commented.cpp:

#include <stdio.h>
//comment

The code should fail, my logic is with those options to regex_search, the match should fail, because it should search for pattern in the first line:

#include <stdio.h>

But instead it searches whole string, and immideatly finds //comment. I need help, to make regex_search match only in current line. The options match_not_bol and match_not_eol do not help me. Of course i can read a file line by line in a vector, and then do match of all rules on each string in vector, but it is very slow, i have done that, and it take too long time to parse a big file like that, that's why i want to let regex deal with new lines, and use positioning counter.

Why to to read file into vector of strings? Then applying regex to current line will be easy. — Artemy Vysotsky, Sep 07 '17 at 04:55
@ArtemyVysotsky i done it, and it work very very slow. 2 minute to process file of 3000 C code strings — YakibutaRamen, Sep 07 '17 at 04:57
So you have a version of the code that works fast and does not do the work. And another version of the code that does what you want but works slowly? I recommend - to ask another question - show the properly working version of your code and ask how to make it faster. — Artemy Vysotsky, Sep 07 '17 at 05:04
You might be seeing the effects of CRLF (two chars) v LF (standard Unix) since you are opening the file in binary mode. — doug, Sep 07 '17 at 06:15
Can accept a suggestion about `std::regex` library. I have no code for you but I can explain what is going on for you — Shakiba Moshiri, Sep 07 '17 at 13:07

Shakiba Moshiri · Answer 1 · 2017-09-07T14:16:32.367

If it is not what you want please comment so I will delete the answer

What you are doing is not a correct way of using a regex library.
Thus here is my suggestion for anyone that wants to use std::regex library.

It only supports ECMAScript that somehow is a little poor than all modern regex library.
It has bugs as many as you like ( just I found ):
In some cases (I test specifically with std::match_results ) It is 200 times slower in comparison to std.regex in d language
It has very confusing flag-match and almost it does not work (at least for me)

conclusion: do not use it at all.

But if anyone still demands to use c++ anyway then you can:

use boost::regex ^{about Boost library} because:
1. It is PCRE support
2. It has less bug ( I have not seen any )
3. It is smaller in bin file ( I mean executable file after compiling )
4. It is faster then std::regex
use gcc version 7.1.0 and NOT below. The last bug I found is in version 6.3.0
use clang version 3 or above

If you have enticed (= persuade) to NOT use c++ then you can use:

Use d regular expression ^link library for large task: std.regex and why:
1. Fast ^{Faster Command Line Tools in D}
2. Easy
3. Flexible ^drn
Use native pcre or pcre2 ^link that have been written in c
- Extremely fast but a little complicated
Use perl for a simple task and specially Perl one-liner ^link

I will try boost and reply to you – YakibutaRamen Sep 09 '17 at 10:56 — YakibutaRamen, Sep 09 '17 at 10:56

Danielle · Answer 2 · 2017-09-07T10:42:12.273

#include <stdio.h> //comment

The code should fail, my logic is with those options to regex_search, the match should fail, because it should search for pattern in the first line:

#include <stdio.h>

But instead it searches whole string, and immideatly finds //comment. I need help, to make regex_search match only in current line.

Are you trying to match all // comments in a source code file, or only the first line?

The former can be done like this:

#include <iostream>
#include <fstream>
#include <regex>

int main()
{
  auto input = std::ifstream{"stream_union.h"};

  for(auto line = std::string{}; getline(input, line); )
  {
    auto submatch = std::smatch{};
    auto pattern = std::regex(R"(//)");
    std::regex_search(line, submatch, pattern);

    auto match = submatch.str(0);
    if(match.empty()) continue;

    std::cout << line << std::endl;
  }
  std::cout << std::endl;

  return EXIT_SUCCESS;
}

And the later can be done like this:

#include <iostream>
#include <fstream>
#include <regex>

int main()
{
  auto input = std::ifstream{"stream_union.h"};
  auto line = std::string{};
  getline(input, line);

  auto submatch = std::smatch{};
  auto pattern = std::regex(R"(//)");
  std::regex_search(line, submatch, pattern);

  auto match = submatch.str(0);
  if(match.empty()) { return EXIT_FAILURE; }

  std::cout << line << std::endl;

  return EXIT_SUCCESS;
}

If for any reason you're trying to get the position of the match, tellg() will do that for you.

I'm trying to match all source file, to divide it by logical tokens. Tha'ts why i need to process line by line, i included only regex for comment to explain what is problem. But really i have a set of regex, each one designed to match an include/identifier/operator of the C source code. I need to do the search of all those regexes on each line, to do code tokenizing. What i do is a work, that compiler does, when parse source file. — YakibutaRamen, Sep 09 '17 at 10:58
You can't parse C with regex. You can token out individual words, but you can't parse most of the language without more tools than just regex. It should be plenty fast to run a dozen regexes on a single line of code at a time. The example I wrote above regexes a single line, not the entire source code at once. This should speed things up quite a bit. And for any reason if you need it even faster, re2 is the fastest c++ regex library (that I know of), but for basic tokenization you shouldn't notice a difference. It is the size of the input into the regex that determines the speed. — Danielle, Sep 09 '17 at 22:40

std regex_search to match only current line

2 Answers2

Linked