2

I am running a c++ program in VS. I provided a regex and I am parsing a file which is over 2 million lines long for strings that match that regex. Here is the code:

int main() {
    ifstream myfile("file.log");
    if (myfile.is_open())
    {
        int order_count = 0;
        regex pat(R"(.*(SOME)(\s)*(TEXT).*)");
        for (string line; getline(myfile, line);)
        {
            smatch matches;
            if (regex_search(line, matches, pat)) {
                order_count++;
            }
        }
        myfile.close();
        cout << order_count;
    }

    return 0;
}

The file should search for the matched strings and count their occurrences. I have a python version of the program that does this within 4 seconds using the same regex. I have been waiting around 5 minutes for the above c++ code to work and it still hasn't finished. It is not running into an infinite loop because I had it print out its current line number at certain intervals and it is progressing.Is there a different way I should write the above code?

EDIT: This is run in release mode.

EDIT: Here is the python code:

class PythonLogParser:

def __init__(self, filename):
    self.filename = filename

def open_file(self):
    f = open(self.filename)
    return f

def count_stuff(self):
    f = self.open_file()
    order_pattern = re.compile(r'(.*(SOME)(\s)*(TEXT).*)')
    order_count = 0
    for line in f:
        if order_pattern.match(line) != None:
            order_count+=1 # = order_count + 1
    print 'Number of Orders (\'ORDER\'): {0}\n'.format(order_count)
    f.close()

The program finally stopped running. What's most disconcerting is that the output is incorrect (I know what the correct value should be).

Perhaps using regex for this problem is not the best solution. I will update if I find a solution that works better.

EDIT: Based on the answer by @ecatmur, I made the following changes, and the c++ program ran much faster.

int main() {
        ifstream myfile("file.log");
        if (myfile.is_open())
        {
            int order_count = 0;
            regex pat(R"(.*(SOME)(\s)*(TEXT).*)");
            for (string line; getline(myfile, line);)
            {
                if (regex_match(line, pat)) {
                    order_count++;
                }
            }
            myfile.close();
            cout << order_count;
        }

        return 0;
    }
Jeremy Fisher
  • 2,510
  • 7
  • 30
  • 59
  • 3
    Do you run it in Debug mode? – TobiasR. Jul 16 '15 at 13:56
  • 2
    Add your python version as well. please. Do you compare on the same machine? Is **antivirus software** turned off? –  Jul 16 '15 at 13:58
  • 1
    make sure you read answers here: http://stackoverflow.com/questions/14205096/c11-regex-slower-than-python – Alexander Balabin Jul 16 '15 at 14:00
  • 1
    Provide us an actual, working implementation in C++ and Python that we can run for ourselves. The data you give us need not be 2 million lines long, maybe 20 lines long (maybe 2 lines long, I don't care) – AndyG Jul 16 '15 at 14:05
  • 2
    I guess your python version is performing **one** read (in `for line in f` it will read the whole file) while your C++ program is doing ~ **2M** reads (one for every line). Reading/writing from/to the disk is generally slow, so I guess it's better to read the whole file content to once and then parse the corresponding string line by line. – a_guest Jul 16 '15 at 14:05
  • @AndyG sorry I must have accidentally deleted it when I was formatting my question. It is actually order_count. – Jeremy Fisher Jul 16 '15 at 14:06
  • 1
    Which `regex` do you use? If that is `std::regex` do you have boost installed and can you try `boost::regex`? – Slava Jul 16 '15 at 14:09
  • @Slava it is standard regex, since the users that will be using this application also don't have boost installed I would like to avoid using boost if possible – Jeremy Fisher Jul 16 '15 at 14:13
  • 1
    Are the regex patterns same? The one in c++ code seems to have different capturing groups vs. ones the python code – ramana_k Jul 16 '15 at 14:14
  • 1
    @Ramana yes they are the same, that was just a question formatting mistake I made. – Jeremy Fisher Jul 16 '15 at 14:15
  • Should we assume that in the full version of the C++ code you actually do something useful with the `matches` object rather than having `regex_search` populate it and then immediately throwing it away? – David K Jul 16 '15 at 14:19
  • 2
    You can test the theory that the C++ program is slowed down by reading one-line-at-a-time: replace the `regex_search` with a line that just increments `order_count`, and see if the program still takes a long time to run. – David K Jul 16 '15 at 14:21
  • My python is a little hazy, but I think there is a difference in the search/match that is being performed. C++ `regex_search` is not the same as `regex_match`, the search will match all possible matches. The python `match` matches from the beginning of the line. For the C++, look at `regex_constants::match_continuous` – Niall Jul 16 '15 at 14:30
  • You need to find cause of the issue, if you try `boost::regex` you may find out that problem with `std::regex`, you can use other libraries then, test it with `boost` just easiest as requires minimal code change. – Slava Jul 16 '15 at 14:33
  • As strace shows on Linux, Python actually reads either 8192 or 4096 bytes each time. And still it is very fast. So, no reading whole file at once. –  Jul 16 '15 at 14:43

1 Answers1

8

You should be using regex_match, not regex_search.

7.2.5.3. search() vs. match()

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string

And:

std::regex_search

regex_search will successfully match any subsequence of the given sequence, whereas std::regex_match will only return true if the regular expression matches the entire sequence.

By using regex_search you are generating n * m match results, where n is the number of characters before and m is the number of characters after the central part of your search string. Not surprisingly, this takes a long time to generate.

In fact, the more efficient would be to use regex_search, but only with the central part of your search string:

    regex pat(R"((SOME)(\s)*(TEXT))");

And use the overload of regex_search that doesn't take a match results out-parameter (as you're ignoring the match results):

        if (regex_search(line, pat)) {    // no "matches"
            order_count++;
        }
ecatmur
  • 152,476
  • 27
  • 293
  • 366
  • The C++ `regex_constants::match_continuous` may also help mimic the python match - my python here is a little hazy. – Niall Jul 16 '15 at 14:36
  • Worked much more quickly than the regex_search after making the changes, however, the output is still incorrect even though I'm using the same regex in both c++ and python. is there a difference between the way c++ interprets regex and the way python interprets regex? – Jeremy Fisher Jul 16 '15 at 14:41
  • 3
    @JeremyFisher oh, plenty; the default C++ regex syntax is Modified ECMAScript, which is pretty close to the Javascript syntax: http://en.cppreference.com/w/cpp/regex/ecmascript ; meanwhile Python has its own syntax: https://docs.python.org/2/library/re.html#regular-expression-syntax - you'll have to look into exactly where they disagree on your pattern. – ecatmur Jul 16 '15 at 14:49
  • @JeremyFisher: change `\s` to `\\s`. To improve your patterns in general, avoid to create useless capture groups. `(\s)*` => `\s*` – Casimir et Hippolyte Jul 16 '15 at 15:13
  • @CasimiretHippolyte he's using a raw literal, so I don't think double-escaping `\s` is necessary. – ecatmur Jul 16 '15 at 15:18