Why is Python faster than C++ in this case?

Question

A program in both Python and C++ is given below, which performs the following task: read white-space delimited words from stdin, print the unique words sorted by string length along with a count of each unique word to stdout. The format for a line of the output is: length, count, word.

For exmaple, with this input file (488kB of a thesaurus) http://pastebin.com/raw.php?i=NeUBQ22T

The output, with formatting, is this:

Here is the program in C++

#include <vector>
#include <string>
#include <iostream>
#include <set>
#include <map>

bool compare_strlen (const std::string &lhs, const std::string &rhs) {
  return (lhs.length() < rhs.length());
}

int main (int argc, char *argv[]) {
  std::string str;
  std::vector<std::string> words;

  /* Extract words from the input file, splitting on whitespace */
  while (std::cin >> str) {
    words.push_back(str);
  }

  /* Extract unique words and count the number of occurances of each word */
  std::set<std::string> unique_words;
  std::map<std::string,int> word_count; 
  for (std::vector<std::string>::iterator it = words.begin();
       it != words.end(); ++it) {
    unique_words.insert(*it);
    word_count[*it]++;
  }

  words.clear();
  std::copy(unique_words.begin(), unique_words.end(),
            std::back_inserter(words));

  // Sort by word length 
  std::sort(words.begin(), words.end(), compare_strlen);

  // Print words with length and number of occurances
  for (std::vector<std::string>::iterator it = words.begin();
       it != words.end(); ++it) {
    std::cout << it->length() << " " << word_count[*it]  << " " <<
              *it << std::endl;
  }

  return 0;
}

Here is the Program in Python:

import fileinput
from collections import defaultdict

words = set()
count = {}
for line in fileinput.input():
  line_words = line.split()
  for word in line_words:
    if word not in words:
      words.add(word)
      count[word] = 1
    else:
      count[word] += 1 

words = list(words)
words.sort(key=len)

for word in words:
  print len(word), count[word], word

For the C++ program, the compiler used was g++ 4.9.0 with the -O3 flag.

The version of Python used was 2.7.3

Time taken for the C++ program:

time ./main > measure-and-count.txt < ~/Documents/thesaurus/thesaurus.txt

real    0m0.687s
user    0m0.559s
sys     0m0.123s

Time taken for the Python program:

time python main.py > measure-and-count.txt < ~/Documents/thesaurus/thesaurus.txt

real    0m0.369s
user    0m0.308s
sys     0m0.029s

The Python program is much faster than the C++ program, and is relatively even faster with larger input sizes. What's going on here? Is my use of the C++ STL incorrect?

Edit: As suggested by a comment and an answer, I changed the C++ program to use std::unordered_set and std::unordered_map.

The following lines were changed

#include <unordered_set>
#include <unordered_map>

...

std::unordered_set<std::string> unique_words;
std::unordered_map<std::string,int> word_count;

The compilation command:

g++-4.9 -std=c++11 -O3 -o main main.cpp

This improved performance only marginally:

time ./main > measure-and-count.txt < ~/Documents/thesaurus/thesaurus.txt

real    0m0.604s
user    0m0.479s
sys     0m0.122s

Edit2: A much faster program in C++

This is a combination of NetVipeC's solution, Dieter Lücking's solution, and the top answer to this question. The real performance killer was cin using an unbuffered read by default. Solved with std::cin.sync_with_stdio(false);. This solution also uses a single container, taking advantage of the ordered map in C++.

#include <vector>
#include <string>
#include <iostream>
#include <set>
#include <map>

struct comparer_strlen {
    bool operator()(const std::string& lhs, const std::string& rhs) const {
        if (lhs.length() == rhs.length())
            return lhs < rhs;
        return lhs.length() < rhs.length();
    }
};

int main(int argc, char* argv[]) {
    std::cin.sync_with_stdio(false);

    std::string str;

    typedef std::map<std::string, int, comparer_strlen> word_count_t;

    /* Extract words from the input file, splitting on whitespace */
    /* Extract unique words and count the number of occurances of each word */
    word_count_t word_count;
    while (std::cin >> str) {
        word_count[str]++;
    }

    // Print words with length and number of occurances
    for (word_count_t::iterator it = word_count.begin();
         it != word_count.end(); ++it) {
        std::cout << it->first.length() << " " << it->second << " "
                  << it->first << '\n';
    }

    return 0;
}

Runtime

time ./main3 > measure-and-count.txt < ~/Documents/thesaurus/thesaurus.txt

real    0m0.106s
user    0m0.091s
sys     0m0.012s

Edit3: A nice and concise version of the Python program was provided by Daniel, it runs in about the same time as the version above:

import fileinput
from collections import Counter

count = Counter(w for line in fileinput.input() for w in line.split())

for word in sorted(count, key=len):
  print len(word), count[word], word

Runtime:

time python main2.py > measure-and-count.txt.py < ~/Documents/thesaurus/thesaurus.txt

real    0m0.342s
user    0m0.312s
sys     0m0.027s

I'm not confident enough to dupe-hammer this, but here's the possible duplicate: http://stackoverflow.com/questions/9371238/why-is-reading-lines-from-stdin-much-slower-in-c-than-python — Mysticial, Jul 22 '14 at 19:27
btw. Python is used for its shortness and not speed: `count = Counter(l for line in fileinput.input() for l in line.split()) for word in sorted(count, key=len): print len(word), count[word], word` — Daniel, Jul 22 '14 at 19:27
Try to lower the optimization settings. Having `-O3` can in certain cases *lower* performance. Experiment with different optimization settings. Also try marking the comparison function `inline`. — Some programmer dude, Jul 22 '14 at 19:35
What happens if you replace std::endl with '\n' maybe the difference is just the repeated flush from std::endl? — Drek, Jul 22 '14 at 20:13
Try `unordered` now that you have the big bottleneck killed. It may save another huge chunk... — Yakk - Adam Nevraumont, Jul 22 '14 at 21:03
The third solution in my post, as opposed to the second, actually uses the ordered `map` to sort during insertion. — OregonTrail, Jul 22 '14 at 21:04
@OregonTrail yes, but the speedup from `unordered_` during building of the count *may* make up for making a `std::vector` and sorting it afterwards. It might not, but it might. — Yakk - Adam Nevraumont, Jul 22 '14 at 21:25

NetVipeC · Accepted Answer · 2014-07-22T21:24:05.347

Test with this, it must be faster than the original C++.

Changes are:

Eliminated the vector words to save the words (there will be saved already in word_count).
Eliminated the set unique_words (in word_count are only the unique words).
Eliminated the second copy to words, not needed.

Eliminated the sort of the words (the order was updated in the map, now the words in the map are order by length and the words with same length are lexicographically order.

#include <vector>
#include <string>
#include <iostream>
#include <set>
#include <map>

struct comparer_strlen_functor {
    operator()(const std::string& lhs, const std::string& rhs) const {
        if (lhs.length() == rhs.length())
            return lhs < rhs;
        return lhs.length() < rhs.length();
    }
};

int main(int argc, char* argv[]) {
    std::cin.sync_with_stdio(false);

    std::string str;

    typedef std::map<std::string, int, comparer_strlen_functor> word_count_t;

    /* Extract words from the input file, splitting on whitespace */
    /* Extract unique words and count the number of occurances of each word */
    word_count_t word_count;
    while (std::cin >> str) {
        word_count[str]++;
    }

    // Print words with length and number of occurances
    for (word_count_t::iterator it = word_count.begin(); it != word_count.end();
         ++it) {
        std::cout << it->first.length() << " " << it->second << " " << it->first
                  << "\n";
    }

    return 0;
}

New version of the reading loop, to read line by line and split. Need #include <boost/algorithm/string/split.hpp>

while (std::getline(std::cin, str)) {
    for (string_split_iterator It = boost::make_split_iterator(
             str, boost::first_finder(" ", boost::is_iequal()));
         It != string_split_iterator(); ++It) {
        if (It->end() - It->begin() != 0)
            word_count[boost::copy_range<std::string>(*It)]++;
    }
}

Testing in Core i5, 8GB RAM, GCC 4.9.0, 32bits, run in 238ms. Updated the code with std::cin.sync_with_stdio(false); and \n as proposed.

Your version only marginally faster at `0.540s` on my machine. — OregonTrail, Jul 22 '14 at 19:40
Yes, the proper way to measure performance would be an average of many runs. In this case I see performance within +-5ms running the test 10 or so times before pasting a representative result. — OregonTrail, Jul 22 '14 at 19:55
I do like your use of the STL better than mine, as you take advantage of the ordered `map` which I was not aware of. — OregonTrail, Jul 22 '14 at 20:01
As noted below, `std::endl` is not "end of line", but rather "end of line and flush". Use `"\n"` if you do not need a flush. — Yakk - Adam Nevraumont, Jul 22 '14 at 20:14
I like this answer because it takes **advantage** of the ordering property of the default `map` in STL. However, most of the speedup in the final answer was from using `std::cin.sync_with_stdio(false);` and `'\n'` instead of `std::endl`. — OregonTrail, Jul 22 '14 at 21:23

score 2 · Answer 2 · 2014-07-22T20:59:51.583

Making three changes, omitting the extra vector (which you not have in python), reserve memory for the word-vector and avoiding the endl (!) in the output:

#include <algorithm>
#include <vector>
#include <string>
#include <iostream>
#include <set>
#include <map>

bool compare_strlen (const std::string &lhs, const std::string &rhs) {
  return (lhs.length() < rhs.length());
}

int main (int argc, char *argv[]) {
    /* Extract words from the input file, splitting on whitespace */
    /* Also count the number of occurances of each word */
    std::map<std::string, int> word_count;
    {
        std::string str;
        while (std::cin >> str) {
            ++word_count[str];
        }
    }

    std::vector<std::string> words;
    words.reserve(word_count.size());
    for(std::map<std::string, int>::const_iterator w = word_count.begin();
        w != word_count.end();
        ++w)
    {
        words.push_back(w->first);
    }

    // Sort by word length
    std::sort(words.begin(), words.end(), compare_strlen);

    // Print words with length and number of occurances
    for (std::vector<std::string>::iterator it = words.begin();
       it != words.end();
       ++it)
    {
        std::cout << it->length() << " " << word_count[*it]  << " " <<
                  *it << '\n';
    }
    return 0;
}

Gives:

Original:

real    0m0.230s
user    0m0.224s
sys 0m0.004s

Improved:

real    0m0.140s
user    0m0.132s
sys 0m0.004s

More Improved by adding std::cin.sync_with_stdio(false); See OregonTrail's question):

real    0m0.107s
user    0m0.100s
sys 0m0.004s

And NetVipeC's solution with std::cin.sync_with_stdio(false); and the replacement of std::endl by '\n':

real    0m0.077s
user    0m0.072s
sys 0m0.004s

Python:

real    0m0.146s
user    0m0.136s
sys 0m0.008s

Your version is still about 30% slower than Python on my machine, but is the fastest solution thus far. — OregonTrail, Jul 22 '14 at 20:06
Wow, removing `std::endl` from NetVipeC's solution makes his only about 8% slower than the Python program on my machine. — OregonTrail, Jul 22 '14 at 20:09

score 1 · Answer 3 · answered Jul 22 '14 at 19:34

1

  std::vector<std::string> words;

  /* Extract words from the input file, splitting on whitespace */
  while (std::cin >> str) {
    words.push_back(str);
  }

This requires constantly repeating allocate/copy/free operations as the vector grows. Either pre-allocate the vector or use something like a list.

answered Jul 22 '14 at 19:34

David Schwartz

179,497
17
214
278

The Python program isn't doing any pre-allocation. Also, pre-allocating would require counting the words in the file. – OregonTrail Jul 22 '14 at 19:42
@OregonTrail If you understand what I'm saying, your response is a non-sequiter. (And it doesn't require counting them. Just pre-allocating, say, a million entries, would be fine. Also, how do you know whether the Python program pre-allocates or not?) – David Schwartz Jul 22 '14 at 19:44
Did you try it out though? I don't see a significant difference. I guess it is dominated by IO. – juanchopanza Jul 22 '14 at 19:51
The Python program dynamically grows the `set()`, Python must `realloc()` this data structure (at least the hash table, unless it has an unusually large default size for `set()`). Also, to **never** have the vector copy itself in the `C++` program, it would have to count the words in the input file in advance, and then rewind the file. – OregonTrail Jul 22 '14 at 19:51
@DavidSchwartz I think you're close. I believe Python must be using some form of buffered read from the input file. cin will not buffer the file in memory, so each read takes IO. Instead, one should read in the whole file (or in batches) and perform the parsing in memory. – Climax Jul 23 '14 at 13:03

Yakk - Adam Nevraumont · Answer 4 · 2014-07-22T21:10:10.427

There are a few problems with your C++ code.

First, you are using mutable strings. Which means you are copying them around a bunch. (Python strings are immutable). Testing for this, I find the effect can actually make the C++ code slower, so lets drop this one.

Second, unordered_ containers are probably a good idea. Testing this, I get a 1/3 speedup by swapping them in/out (using boost::hash algorithm for hashing).

Third, your use of std::endl flushes std::cout on each line. That seems silly.

Forth, std::cin.sync_with_stdio(false); to reduce overhead of std::cin, or don't use them.

Fifth, directly construct the set and map from io, don't round-trip through a std::vector needlessly.

Here is a test program (with hard coded data of about 1/4th the size) with immutable strings (std::shared_ptr<const std::string>) and unordered_ containers with manual hash setups, and a few C++11 features to make the code a bit shorter.

Strip out the large R"( string literal, and replace the stringstream with std::cin.

For more performance, don't use the heavy-weight streaming objects. They do a lot of really, really paranoid work.

To your last point, using `std::cin.sync_with_stdio(false);` gave me a 4x speedup. The latest version is at the bottom of the original post. — OregonTrail, Jul 22 '14 at 20:49
@OregonTrail Updated. About 5x faster than it was (0.05 down to 0.01). Includes code to easily swap between smart-pointer based strings and by-value strings, and unordered vs ordered containers. — Yakk - Adam Nevraumont, Jul 22 '14 at 21:03

score 0 · Answer 5 · answered Jul 23 '14 at 20:37

Here's yet another C++ version that I believe more closely matches the python line-by-line. It attempts to retain the same types of containers and operations found in the python version, with obvious C++ specific tweaks. Note that I lifted the sync_with_stdio optimization from the other answers.

#include <iostream>
#include <unordered_set>
#include <unordered_map>
#include <list>

#include <sstream>
#include <iterator>


bool compare_strlen(const std::string &lhs, const std::string &rhs)
{
    return lhs.length() < rhs.length();
}

int main (int argc, char *argv[]) {
    std::unordered_set<std::string> words;
    std::unordered_map<std::string, std::size_t> count;

    // Make std::cin use its own buffer to improve I/O performance.
    std::cin.sync_with_stdio(false);

    // Extract words from the input file line-by-line, splitting on
    // whitespace
    char line[128] = {};  // Yes, std::vector or std::array would work, too.
    while (std::cin.getline(line, sizeof(line) / sizeof(line[0]))) {
        // Tokenize
        std::istringstream line_stream(line);
        std::istream_iterator<std::string> const end;
        for(std::istream_iterator<std::string> i(line_stream);
            i != end;
            ++i) {
            words.insert(*i);
            count[*i]++;
        }
    }

    std::list<std::string> words_list(words.begin(), words.end());
    words_list.sort(compare_strlen);

    // Print words with length and number of occurences
    for (auto const & word : words_list)
        std::cout << word.length()
                  << ' ' << count[word]
                  << ' ' << word
                  << '\n';

    return 0;
}

The results are comparable to your original python code and @NetVipeC's C++.

C++

real    0m0.979s
user    0m0.080s
sys     0m0.016s

Python

real    0m0.993s
user    0m0.112s
sys     0m0.060s

I was a bit surprised this version of the C++ performed comparably to the other streamlined C++ answers to your question, since I thought for sure things like the stringstream based tokenizing would be a bottleneck.

score -1 · Answer 6 · answered Jul 22 '14 at 19:21

-1

Both std::set and std::map are optimized for lookups, not insertion. They must be sorted/tree-balanced every time you change the contents. You might try using std::unordered_set and std::unordered_map which are hash-based and would be faster for your use case.

answered Jul 22 '14 at 19:21

Khouri Giordano

1,426
15
17

1

Added results for using the `unordered` containers, only marginal performance improvement. – OregonTrail Jul 22 '14 at 19:30

Why is Python faster than C++ in this case?

6 Answers6

Linked