1

I use a various regexes to parse a C source file, line by line. First i read all the content of file in a string:

ifstream file_stream("commented.cpp",ifstream::binary);

std::string txt((std::istreambuf_iterator<char>(file_stream)),
std::istreambuf_iterator<char>());

Then i use a set of regex, which should be applied continusly until the match found, here i will give only one for example:

vector<regex> rules = { regex("^//[^\n]*$") };

char * search =(char*)txt.c_str();

int position = 0, length = 0;

for (int i = 0; i < rules.size(); i++) {
  cmatch match;

  if (regex_search(search + position, match, rules[i],regex_constants::match_not_bol | regex_constants::match_not_eol)) 
  {
     position += ( match.position() + match.length() );        
  }

}

But it don't work. It will match the comment not in the current line, but it will search whole string, for the first match, regex_constants::match_not_bol and regex_constants::match_not_eol should make the regex_search to recognize ^$ as start/end of line only, not end start/end of whole block. So here is my file:

commented.cpp:

#include <stdio.h>
//comment

The code should fail, my logic is with those options to regex_search, the match should fail, because it should search for pattern in the first line:

#include <stdio.h>

But instead it searches whole string, and immideatly finds //comment. I need help, to make regex_search match only in current line. The options match_not_bol and match_not_eol do not help me. Of course i can read a file line by line in a vector, and then do match of all rules on each string in vector, but it is very slow, i have done that, and it take too long time to parse a big file like that, that's why i want to let regex deal with new lines, and use positioning counter.

  • Why to to read file into vector of strings? Then applying regex to current line will be easy. – Artemy Vysotsky Sep 07 '17 at 04:55
  • @ArtemyVysotsky i done it, and it work very very slow. 2 minute to process file of 3000 C code strings – YakibutaRamen Sep 07 '17 at 04:57
  • So you have a version of the code that works fast and does not do the work. And another version of the code that does what you want but works slowly? I recommend - to ask another question - show the properly working version of your code and ask how to make it faster. – Artemy Vysotsky Sep 07 '17 at 05:04
  • You might be seeing the effects of CRLF (two chars) v LF (standard Unix) since you are opening the file in binary mode. – doug Sep 07 '17 at 06:15
  • Can accept a suggestion about `std::regex` library. I have no code for you but I can explain what is going on for you – Shakiba Moshiri Sep 07 '17 at 13:07

2 Answers2

1

If it is not what you want please comment so I will delete the answer

What you are doing is not a correct way of using a regex library.
Thus here is my suggestion for anyone that wants to use std::regex library.

  1. It only supports ECMAScript that somehow is a little poor than all modern regex library.
  2. It has bugs as many as you like ( just I found ):

    1. the same regex but different results on Linux and Windows only C++
    2. std::regex and ignoring flags
    3. std::regex_match and lazy quantifier with strange behavior
  3. In some cases (I test specifically with std::match_results ) It is 200 times slower in comparison to std.regex in language

  4. It has very confusing flag-match and almost it does not work (at least for me)

conclusion: do not use it at all.


But if anyone still demands to use anyway then you can:

  1. use boost::regex about Boost library because:

    1. It is PCRE support
    2. It has less bug ( I have not seen any )
    3. It is smaller in bin file ( I mean executable file after compiling )
    4. It is faster then std::regex
  2. use gcc version 7.1.0 and NOT below. The last bug I found is in version 6.3.0

  3. use clang version 3 or above

If you have enticed (= persuade) to NOT use then you can use:

  1. Use regular expression link library for large task: std.regex and why:

    1. Fast Faster Command Line Tools in D
    2. Easy
    3. Flexible drn
  2. Use native pcre or pcre2 link that have been written in

    • Extremely fast but a little complicated
  3. Use for a simple task and specially Perl one-liner link
Shakiba Moshiri
  • 21,040
  • 2
  • 34
  • 44
0

#include <stdio.h> //comment

The code should fail, my logic is with those options to regex_search, the match should fail, because it should search for pattern in the first line:

#include <stdio.h>

But instead it searches whole string, and immideatly finds //comment. I need help, to make regex_search match only in current line.

Are you trying to match all // comments in a source code file, or only the first line?

The former can be done like this:

#include <iostream>
#include <fstream>
#include <regex>

int main()
{
  auto input = std::ifstream{"stream_union.h"};

  for(auto line = std::string{}; getline(input, line); )
  {
    auto submatch = std::smatch{};
    auto pattern = std::regex(R"(//)");
    std::regex_search(line, submatch, pattern);

    auto match = submatch.str(0);
    if(match.empty()) continue;

    std::cout << line << std::endl;
  }
  std::cout << std::endl;

  return EXIT_SUCCESS;
}

And the later can be done like this:

#include <iostream>
#include <fstream>
#include <regex>

int main()
{
  auto input = std::ifstream{"stream_union.h"};
  auto line = std::string{};
  getline(input, line);

  auto submatch = std::smatch{};
  auto pattern = std::regex(R"(//)");
  std::regex_search(line, submatch, pattern);

  auto match = submatch.str(0);
  if(match.empty()) { return EXIT_FAILURE; }

  std::cout << line << std::endl;

  return EXIT_SUCCESS;
}

If for any reason you're trying to get the position of the match, tellg() will do that for you.

Danielle
  • 111
  • 1
  • 6
  • I'm trying to match all source file, to divide it by logical tokens. Tha'ts why i need to process line by line, i included only regex for comment to explain what is problem. But really i have a set of regex, each one designed to match an include/identifier/operator of the C source code. I need to do the search of all those regexes on each line, to do code tokenizing. What i do is a work, that compiler does, when parse source file. – YakibutaRamen Sep 09 '17 at 10:58
  • You can't parse C with regex. You can token out individual words, but you can't parse most of the language without more tools than just regex. It should be plenty fast to run a dozen regexes on a single line of code at a time. The example I wrote above regexes a single line, not the entire source code at once. This should speed things up quite a bit. And for any reason if you need it even faster, re2 is the fastest c++ regex library (that I know of), but for basic tokenization you shouldn't notice a difference. It is the size of the input into the regex that determines the speed. – Danielle Sep 09 '17 at 22:40