Python with Numpy/Scipy vs. Pure C++ for Big Data Analysis

Question

Doing Python on relatively small projects makes me appreciate the dynamically typed nature of this language (no need for declaration code to keep track of types), which often makes for a quicker and less painful development process along the way. However, I feel that in much larger projects this may actually be a hindrance, as the code would run slower than say, its equivalent in C++. But then again, using Numpy and/or Scipy with Python may get your code to run just as fast as a native C++ program (where the code in C++ would sometimes take longer to develop).

I post this question after reading Justin Peel's comment on the thread "Is Python faster and lighter than C++?" where he states: "Also, people who speak of Python being slow for serious number crunching haven't used the Numpy and Scipy modules. Python is really taking off in scientific computing these days. Of course, the speed comes from using modules written in C or libraries written in Fortran, but that's the beauty of a scripting language in my opinion." Or as S. Lott writes on the same thread regarding Python: "...Since it manages memory for me, I don't have to do any memory management, saving hours of chasing down core leaks." I also inspected a Python/Numpy/C++ related performance question on "Benchmarking (python vs. c++ using BLAS) and (numpy)" where J.F. Sebastian writes "...There is no difference between C++ and numpy on my machine."

Both of these threads got me to wondering whether there is any real advantage conferred to knowing C++ for a Python programmer that uses Numpy/Scipy for producing software to analyze 'big data' where performance is obviously of great importance (but also code readability and development speed are a must)?

Note: I'm especially interested in handling huge text files. Text files on the order of 100K-800K lines with multiple columns, where Python could take a good five minutes to analyze a file "only" 200K lines long.

If you're really concerned with speed for those text files, it'd be worth benchmarking where the time is spent - probably mostly in disk access as @HenryKeiter suggests, but if the text processing is adding significantly, you may find gains by cleverly using python builtins (which will be much faster than python loops etc..) and/or processing the text with Cython (with appropriate c_types - a bit more of a learning curve there, but probably easier than C++). — drevicko, Nov 03 '14 at 23:19

Henry Keiter · Accepted Answer · 2017-09-26T21:48:57.467

11

First off, if the bulk of your "work" comes from processing huge text files, that often means that your only meaningful speed bottleneck is your disk I/O speed, regardless of programming language.

As to the core question, it's probably too opinion-rich to "answer", but I can at least give you my own experience. I've been writing Python to do big data processing (weather and environmental data) for years. I have never once encountered significant performance problems due to the language.

Something that developers (myself included) tend to forget is that once the process runs fast enough, it's a waste of company resources to spend time making it run any faster. Python (using mature tools like pandas/scipy) runs fast enough to meet the requirements, and it's fast to develop, so for my money, it's a perfectly acceptable language for "big data" processing.

edited Sep 26 '17 at 21:48

answered Jul 31 '14 at 04:25

Henry Keiter

16,863
7
51
80

I know that weather and environmental data is on the scale of terabytes, frequently making frameworks like Hadoop very useful (where the innate language is Java (but also has Python and C++ streaming)). From your multiyear experience working with such big data using Python, do you ever find that there are times when implementing your solutions in C++ would be more conducive for your big data purposes (albeit less productive in terms of development speed and costs)? – warship Jul 31 '14 at 04:56
1

@XYZ927 I've never found Python to be a meaningful bottleneck. There are packages optimized for the purpose, as you've noted, and I've personally never encountered a case where these are insufficient. Especially considering how complex these processes tend to be, I think the readability and clarity of Python is a huge benefit. *Could* they be made faster in pure C/C++/FORTRAN? Probably, but personally I haven't found it to be worth the development effort. These things tend to be run overnight anyway-- as long as it's done by morning, who cares whether it finished at 4:30 or 5:00? – Henry Keiter Jul 31 '14 at 05:17
Thank you for your feedback. I also would like to reference one more post I found concerning this matter which shows that C++ code, if not written in a certain way, can actually run slower than Python: http://stackoverflow.com/questions/9371238/why-is-reading-lines-from-stdin-much-slower-in-c-than-python?rq=1 – warship Aug 01 '14 at 20:40
Lol yeah, that is the reason why big hedge funds use C++ purely. They simply don't know that python is "good" for big data – Dec 30 '17 at 09:19

Damascus Steel · Answer 2 · 2014-07-31T02:56:36.257

The short answer is that for simple problems, then there should not be much difference. If you want to do anything complicated, then you will quickly run into stark performance differences.

As a simple example, try adding three vectors together

a = b + c + d

In python, as I understand it, this generally adds b to c, adds the result to d, and then make a point to that final result. Each of those operations can be fast since they are just farmed out to a BLAS library. However, if the vectors are large, then the intermediate result can not be stored in cache. Moving that intermediate result to main memory is slow.

You can do the same thing in C++ using valarray and it will be equivalently slow. However, you can also do something else

for(int i=0; i<N; ++i)
  a[i] = b[i] + c[i] + d[i]

This gets rid of the intermediate result and makes the code less sensitive to speed to main memory.

Doing the equivalent thing in python is possible, but python's looping constructs are not as efficient. They do nice things like bounds checks, but sometimes it is faster to run with the safeties disengaged. Java, for example, does a fair amount of work to remove bounds checks. So if you had a sufficiently smart compiler/JIT, python's loops could be fast. In practice, that has not worked out.

I should have specified in my question that I am not so much interested in multidimensional matrices as I am in huge text files. Text files on the order of 100K-800K lines with multiple columns, where Python could take a good five minutes to analyze a file "only" 200K lines long. — warship, Jul 31 '14 at 02:16
@warship: The argument of creating custom optimized loops instead of gluing together optimized building blocks applies more generally. *If* you take the time to manually vectorize with SIMD, (or write C++ that can auto-vectorize), you can get blazingly fast performance, especially within L1D or L2 cache. If the standard building blocks don't get the job done in one or two steps, manually looping can be a big win in C++. — Peter Cordes, Sep 26 '17 at 21:58

score 1 · Answer 3 · answered Sep 03 '15 at 20:09

Python will definitely save your development time, it provides you flexibility as well if you are just comparing two languages here, though it still can't match the power and performance of C/C++ but who cares in this age of high memory, clusters, caching and parallel processing techniques? Another disadvantage with C++ can be the possible crashes and then debugging and fixing with big data can be a nightmare.

But having said that I have not seen a place where there is one size fit all solution is available , No programming language contains solutions to every problem, (unless you are an old native C developer who like to build the database in C as well :) you have to first identify all the problems, requirements, type of data, whether it is structured or non structured, what sort of text files you need to manipulate in what way and order, is scheduling an issue and so on... Then you need to build a complete stack of applications with some tool sets and scripting languages. Like you can always put more money in hardware or even buy some expensive tool like Ab Initio which give you power of loading and parsing those large text files and manipulate over the data unless you don't need real high end pattern matching capabilities on really biggg data files, python would be just fine with a conjunction of other tools. But i don't see a single yes/no answer, in certain situations, python may not be the best solution.

Python with Numpy/Scipy vs. Pure C++ for Big Data Analysis

3 Answers3

Linked