String matching performance: gcc versus CPython

Question

Whilst researching performance trade-offs between Python and C++, I've devised a small example, which mostly focusses on a dumb substring matching.

Here is the relevant C++:

using std::string;
std::vector<string> matches;
std::copy_if(patterns.cbegin(), patterns.cend(), back_inserter(matches),
   [&fileContents] (const string &pattern) { return fileContents.find(pattern) != string::npos; } );

The above is built with -O3.

And here is Python:

def getMatchingPatterns(patterns, text):
    return filter(text.__contains__, patterns)

Both of them take a large-ish set of patterns and input file, and filter down the list of patterns to the ones found in the file using a dumb substring search.

The versions are:

gcc - 4.8.2 (Ubuntu) and 4.9.2 (cygwin)
python - 2.7.6 (Ubuntu) and 2.7.8 (cygwin)

What was surprising to me is the performance. I've run both on a low-spec Ubuntu and Python was faster by about 20%. The same on mid-spec PC with cygwin - Python twice faster. Profiler shows that 99+% of cycles are spent in string matching (string copying and list comprehensions are insignificant).

Obviously, the Python implementation is native C, and I'd expected to be roughly the same as C++, but did not expect it as fast.

Any insight into relevant CPython optimisations in comparison to gcc would be most welcome.

For reference, here are the full examples. The inputs just take a set of 50K HTLMs (all read from disk in each test, no special caching):

Python:

import sys

def getMatchingPatterns(patterns, text):
   return filter(text.__contains__, patterns)

def serialScan(filenames, patterns):
   return zip(filenames, [getMatchingPatterns(patterns, open(filename).read()) for filename in filenames])

if __name__ == "__main__":
   with open(sys.argv[1]) as filenamesListFile:
      filenames = filenamesListFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   resultTuple = serialScan(filenames, patterns)
   for filename, patterns in resultTuple:
      print ': '.join([filename, ','.join(patterns)])

C++:

#include <iostream>
#include <iterator>
#include <fstream>
#include <string>
#include <vector>
#include <unordered_map>
#include <algorithm>

using namespace std;
using MatchResult = unordered_map<string, vector<string>>;
static const size_t PATTERN_RESERVE_DEFAULT_SIZE = 5000;

MatchResult serialMatch(const vector<string> &filenames, const vector<string> &patterns)
   {
   MatchResult res;
   for (auto &filename : filenames)
      {
      ifstream file(filename);
      const string fileContents((istreambuf_iterator<char>(file)),
                                         istreambuf_iterator<char>());
      vector<string> matches;
      std::copy_if(patterns.cbegin(), patterns.cend(), back_inserter(matches),
                   [&fileContents] (const string &pattern) { return fileContents.find(pattern) != string::npos; } );

      res.insert(make_pair(filename, std::move(matches)));
      }
   return res;
   }

int main(int argc, char **argv)
    {
    vector<string> filenames;
    ifstream filenamesListFile(argv[1]);
    std::copy(istream_iterator<string>(filenamesListFile), istream_iterator<string>(),
             back_inserter(filenames));

    vector<string> patterns;
    patterns.reserve(PATTERN_RESERVE_DEFAULT_SIZE);
    ifstream patternsFile(argv[2]);
    std::copy(istream_iterator<string>(patternsFile), istream_iterator<string>(),
             back_inserter(patterns));

    auto matchResult = serialMatch(filenames, patterns);

    for (const auto &matchItem : matchResult)
      {
      cout << matchItem.first << ": ";
      for (const auto &matchString : matchItem.second)
         cout << matchString << ",";
      cout << endl;
      }
    }

@SylvainLeroux - I've run profiler, 99% of time is spent in the actual matching, `std::string::find` and `string.__contains__`. There are string copies from `patterns` (as I scan multiple files one after another, and can't move), but they are insignificant. — RomanK, Mar 15 '15 at 09:23
This is not a question of "how can Python be so fast" but "how can C++ be so slow" — Antti Haapala -- Слава Україні, Mar 15 '15 at 09:33
Did you exclude file caching? gcc version? Could you provide a MCVE? — edmz, Mar 15 '15 at 09:34
See for example [this](http://stackoverflow.com/questions/9371238/why-is-reading-lines-from-stdin-much-slower-in-c-than-python) and [this](http://stackoverflow.com/questions/9688305/python-faster-than-c-how-does-this-happen), that is you wrongly assume that naive C++ is as fast as C. — Antti Haapala -- Слава Україні, Mar 15 '15 at 09:36
FYI, you'll get a further speed increase (by a factor of 2 on my machine) using a list comprehension rather than filter -- `[p for p in patterns if p in text]` — Dunes, Mar 15 '15 at 09:43
@Dunes don't think so, it is clear that the actual pattern matching dwarfs the time spent on filter vs listcomp. — Antti Haapala -- Слава Україні, Mar 15 '15 at 09:46
@black - I did exclude caching during the test. Edited the question, and added versions and full compilable code. @AnttiHaapala - I do not assume C++ to be as fast as C when we deal with streams. However, this question is specific about substring matching, rather than covering I/O. There, the tradeoff of stream buffering with raw `printf/scanf` does not apply. You are correct - it might be more about C++ slowness than Python being fast, and indeed filter vs. listcomp is insignificant here. — RomanK, Mar 15 '15 at 09:48
Correction: I forgot to remove wrapping filter with a list call in python 2.7. @AnttiHaapala However, I tested both filter and the list comprehension with `timeit` and the list comprehension is 20% on python 2.7, and about 30% faster on python 3.4. — Dunes, Mar 15 '15 at 09:49
@Dunes - what was the size of your input and the size of patterns? For patterns, I've used the first 40K lines of `/usr/share/dict/words` and inputs were just contents of top 10 Alexa sites. With input and patterns' sizes going into 10s of KBs, list comprehension versus filtering becomes insignificant. — RomanK, Mar 15 '15 at 09:50
@RomanK I was using a much smaller test set. Just testing for single letters in the ASCII charset. — Dunes, Mar 15 '15 at 09:53
This seems to be an unfair comparison. You're using a regex in python, but not with C++. This will also produce different results. ie. `re.findall("pat|pattern", "pattern")` only finds `"pat"`, whereas the C++ code would find both patterns. — Dunes, Mar 15 '15 at 10:08
@Dunes - Apologies - I posted incorrect code the first time. The right sources are in place now. — RomanK, Mar 15 '15 at 10:11
@AnttiHaapala - Good question. With C++ it's definitely `basic_string`, which means `char`, which in turn means bytes on both platforms. I'm less certain about Python, but would be surprised if we do not deal with bytes as well. — RomanK, Mar 15 '15 at 10:52
Ah actually it shouldn't matter either, it works equally fine with Unicode strings in Python too... — Antti Haapala -- Слава Україні, Mar 15 '15 at 11:16

Antti Haapala -- Слава Україні · Accepted Answer · 2018-08-13T14:52:52.343

17

The python 3.4 code b'abc' in b'abcabc' (or b'abcabc'.__contains__(b'abc') as in your example) executes the bytes_contains method, which in turn calls the inlined function stringlib_find; which delegates the search to FASTSEARCH.

The FASTSEARCH function then uses a simplified Boyer-Moore search algorithm (Boyer-Moore-Horspool):

fast search/count implementation, based on a mix between boyer- moore and horspool, with a few more bells and whistles on the top. for some more background, see: http://effbot.org/zone/stringlib.htm

There are some modifications too, as noted by the comments:

note: fastsearch may access s[n], which isn't a problem when using Python's ordinary string types, but may cause problems if you're using this code in other contexts. also, the count mode returns -1 if there cannot possible be a match in the target string, and 0 if it has actually checked for matches, but didn't find any. callers beware!

The GNU C++ Standard Library basic_string<T>::find() implementation is as generic (and dumb) as possible; it just tries dumbly matching the pattern at each and every consecutive character position until it finds the match.

TL;DR: The reason why the C++ standard library is so slow compared to Python is because it tries to do a generic algorithm on top of std::basic_string<char>, but fails to do it efficiently for the more interesting cases; whereas in Python the programmer gets the most efficient algorithms on case-by-case basis for free.

edited Aug 13 '18 at 14:52

answered Mar 15 '15 at 11:07

Antti Haapala -- Слава Україні

129,958
22
279
321

2

Could someone time C++ with Boyer-Moore vs Python? Sadly I cannot compile anything right now, but Boost has an [implementation of the search](http://www.boost.org/doc/libs/1_57_0/libs/algorithm/doc/html/algorithm/Searching.html#the_boost_algorithm_library.Searching.BoyerMoore). – Baum mit Augen Mar 15 '15 at 11:18
Thanks, this is exactly the sort of answer I was looking for. Technically, nothing prevents `libstdc++` writers from writing a template specialization for `basic_string::find` which does more efficient matching without `char_traits`, so it seems like it's more of a shortcoming on gcc's part. @BaummitAugen - Thanks for the tip, I'll try running giving the Boost library a go. – RomanK Mar 15 '15 at 11:22
3

I'd like to point out that there's a [proposal](http://open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3905.html) to enhance `std::search` and to make use of faster algorithms, such as Boyer-Moore, in specific cases. – edmz Mar 15 '15 at 14:12
2

@BaummitAugen - I've run the same example with Boost's Boyer-Moore. This confirmed yet again the answer above: the performance with Boost was almost exactly the same as with Python - with CPython being just a couple of percent faster. By comparison, the default libstdc++ implementation is 20-30% slower. – RomanK Mar 15 '15 at 21:06
"Use a special algorithm in C++ to get it almost as fast as Python" – Antti Haapala -- Слава Україні Mar 15 '15 at 21:07
@RomanK You can also try [this](http://www.boost.org/doc/libs/1_57_0/libs/algorithm/doc/html/the_boost_algorithm_library/Searching/BoyerMooreHorspool.html), in the proposal paper linked above it apparently was a little bit faster than what I linked before. – Baum mit Augen Mar 15 '15 at 21:12
2

@BaummitAugen - It turned out to be way more than a little bit faster! 35% speed-up compared to CPython and Boost's Boyer-Moore! @AnttiHaapala - As per your answer, it's not so much about languages, but about algorithms. I still can't see a reason why `std::basic_string::find` could not be specialised to do any of the two algorithms we just tried, but at least there's movement in the right direction as per @black 's comment. Substring matching is a very common operation, so there's definitely a pitfall in C++. Very few people would know about and go for Boost's algorithms. – RomanK Mar 15 '15 at 21:30
@RomanK If you are serious about writing C++, just see Boost as part of the standard library. It fixes quit a lot of shortcomings. – Baum mit Augen Mar 15 '15 at 21:37
@BaummitAugen - On one hand I agree with that. On another, I tend to go for Boost only when looking for functionality missing in `std`. Now, thanks to this thread, I know about `std::string::find` being non-performant, but it's by no means common knowledge. – RomanK Mar 16 '15 at 07:06

String matching performance: gcc versus CPython

1 Answers1

Linked