0

I used regex_token_iterator<> to get all matched substrings in a line, as suggested in this question. But the code sometimes misses 2nd matched substrings in lines, and the lines where this miss happens changes at different runs. Is this a bug of regex_token_iterator<>, or is there something wrong in my code? The compiler I used is Apple clang version 14.0.0 (clang-1400.0.29.202), and I used -std=c++14 to compile the following code.

I also tried another suggestion in the question above, which is to use while-loop to repeatedly apply regex_search(), and that version of code worked properly. I just want to know why the version with regex_token_iterator<> is not working, whether my usage is wrong or not.

code:

#include<regex>
#include<iostream>
#include<string>
#include<fstream>
#include<sstream>

using namespace std;

struct bad_from_string : bad_cast{
  const char* what() const noexcept override{
    return "bad cast from string";
  }
};

template<typename T>
T from_string(const string& s){
  istringstream is{s};
  T t;
  if(!(is>>t))
    throw bad_from_string{};
  return t;
}

int main(){
  regex pat{R"((\d{1,2})/(\d{1,2})/(\d{4}))"}; // e.g. 7/21/2022
  ifstream ifs{"test_regex_token_iterator.txt"};
  ofstream ofs{"test_out_regex_token_iterator.txt"};

  regex_token_iterator<string::iterator> rend; // default constructor is used for indicating the end of the sequence
  
  for(string line; getline(ifs, line);){
    smatch matches;
    
    string replace_pattern; 

    int month{0}, day{0}, year{0};

    regex_token_iterator<string::iterator> riter(line.begin(), line.end(), pat);
      
    // for each matched substring, replace it individually
    while(riter!=rend){
      string matched_substring{(*riter).str()};
      // *riter returns a reference to the sub_match object riter is pointing to.
      // sub_match is not a string. sub_match::str() returns the string of the sub_match.
      
      // put each matched substring into variable "matches"
      regex_search(matched_substring, matches, pat);
      
      // get the day, month, and year values in int
      day = from_string<int>(matches.str(2));
      month = from_string<int>(matches.str(1));
      year = from_string<int>(matches.str(3));
      
      // here make replace_pattern yyyy-mm-dd
      if(month<10 && day<10)
        replace_pattern = to_string(year)+"-0"+to_string(month)+"-0"+to_string(day); // both day and month need the fron '0'
      else if(month<10)
        replace_pattern = to_string(year)+"-0"+to_string(month)+"-"+to_string(day);
      else if(day<10)
        replace_pattern = to_string(year)+"-"+to_string(month)+"-0"+to_string(day);
      else
        replace_pattern = to_string(year)+"-"+to_string(month)+"-"+to_string(day);
      
      line = regex_replace(line, regex(matched_substring), replace_pattern); // regex_replace() returns a string
      // since I want to replace only 1 matched substring *riter, I use the exact substring 
      // in the place of regex pattern
      
      ++riter;      // move to the next matched substring
    }
  
    ofs << line << endl; 
  }
  
  return 0;
}

test_regex_token_iterator.txt:

12/01/2022 - 12/31/2022
12/01/2022 - 12/31/2022
12/01/2022 - 12/31/2022
12/01/2022 - 12/31/2022

10/01/2022 - 10/31/2022
10/01/2022 - 10/31/2022
10/01/2022 - 10/31/2022
10/01/2022 - 10/31/2022
10/01/2022 - 10/31/2022

sample test_out_regex_token_iterator.txt (but the result changes in different runs):

2022-12-01 - 12/31/2022
2022-12-01 - 2022-12-31
2022-12-01 - 12/31/2022
2022-12-01 - 12/31/2022

2022-10-01 - 10/31/2022
2022-10-01 - 2022-10-31
2022-10-01 - 10/31/2022
2022-10-01 - 10/31/2022
2022-10-01 - 10/31/2022

I expected all the matched substrings, including the dates in the 2nd column, were replaced, but only part of them were replaced properly. The expected result:

2022-12-01 - 2022-12-31
2022-12-01 - 2022-12-31
2022-12-01 - 2022-12-31
2022-12-01 - 2022-12-31

2022-10-01 - 2022-10-31
2022-10-01 - 2022-10-31
2022-10-01 - 2022-10-31
2022-10-01 - 2022-10-31
2022-10-01 - 2022-10-31
taka
  • 31
  • 5
  • 1
    `-fsanitize=address` is your friend. It will tell you immediately where the problem is. – n. m. could be an AI Nov 16 '22 at 06:28
  • @n.m. Thank you for the useful option. I used it and ran the buggy program before applying the answer, but all I can know is somewhere in the program, heap-use-after-free happened, and other messages are rather cryptic. So after knowing such a problem is happening, the option doesn't tell me at which line of the program the problem was caused, does it? i.e. I have to carefully read the program to find where that error occurred – taka Nov 17 '22 at 10:41
  • 1
    It tells you that the line `++riter` causes the problem. Why could incrementing an operator cause a problem? Because the iterator is invalid. Why is it invalid? Etc etc – n. m. could be an AI Nov 17 '22 at 11:37
  • Thank you for the follow-up. The problem I had with the output messages was that they specified addresses in the executable binary, not line numbers in the source file. I could solve this by adding -g option when compiling. – taka Nov 23 '22 at 00:27

1 Answers1

1

enabling address sanitiser shows that your code is causing undefined behaviour: https://godbolt.org/z/n3rnn9nqY

riter contains iterators from line but at the end of your while loop you reassign line, invalidating line's iterators and therefore invalidating riter, when you then try to increment riter you enter the realms of undefined behaviour.

Adding a separate string for your output fixes the problem: https://godbolt.org/z/Grqe1vv5x

for(string line; getline(ifs, line);){
  smatch matches;
  string outputLine = line;
  
  string replace_pattern; 

  int month{0}, day{0}, year{0};

  regex_token_iterator<string::iterator> riter(line.begin(), line.end(), pat);
    
  // for each matched substring, replace it individually
  while(riter!=rend){
    string matched_substring{(*riter).str()};
    // *riter returns a reference to the sub_match object riter is pointing to.
    // sub_match is not a string. sub_match::str() returns the string of the sub_match.
    
    // put each matched substring into variable "matches"
    regex_search(matched_substring, matches, pat);
    
    // get the day, month, and year values in int
    day = from_string<int>(matches.str(2));
    month = from_string<int>(matches.str(1));
    year = from_string<int>(matches.str(3));
    
    // here make replace_pattern yyyy-mm-dd
    if(month<10 && day<10)
      replace_pattern = to_string(year)+"-0"+to_string(month)+"-0"+to_string(day); // both day and month need the fron '0'
    else if(month<10)
      replace_pattern = to_string(year)+"-0"+to_string(month)+"-"+to_string(day);
    else if(day<10)
      replace_pattern = to_string(year)+"-"+to_string(month)+"-0"+to_string(day);
    else
      replace_pattern = to_string(year)+"-"+to_string(month)+"-"+to_string(day);
    
    outputLine = regex_replace(outputLine, regex(matched_substring), replace_pattern); // regex_replace() returns a string
    // since I want to replace only 1 matched substring *riter, I use the exact substring 
    // in the place of regex pattern
    
    ++riter;      // move to the next matched substring
  }

  ofs << outputLine << endl; 
}
Alan Birtles
  • 32,622
  • 4
  • 31
  • 60