3

We have a data import / transformation tool currently written in C++ which is processing mostly textual (e.g. timestamps in several different formats) data and applies some normalization / sanitization.

The current solution is very fast and performance is critical. E.g. we optimized to minimize / avoid object allocation in the parsing loop as this gets called for each row we process and can have a serious performance impact.

What we want to achieve is being flexible with adding new input formats or applying additional transformations simply by changing a script/configuration file instead of recompiling the application.

The question is: would we be able to achieve a comparable performance using the Python Pandas data processing framework, as most of it also seems to use low-level optimized C code.

("comparable" here means performance difference is within a margin which may be compensated by adding 2-4 additional threads)

Is Pandas the right tool for this job or are there different suggestions?

VolkA
  • 34,983
  • 7
  • 37
  • 37
  • 1
    Pandas is certainly fast. I doubt if it's faster than pure C++, but as I'm sure you know the only way to tell for sure is to profile each solution. The best tip I know of for speeding up Pandas timeseries CSV parsing is this: Use a custom date-parsing function that utilizes string-slicing as opposed to strptime. – Brian Nov 07 '13 at 16:49
  • There are tools that create `c` code from python but even if you do that i don't think it will be faster then your version since python has overhead for being a dynamic language. You can probably achieve flexibility by adding fallback to python implementation when c++ implementation is not available yet. – Raxvan Nov 07 '13 at 17:01
  • @Exceptyon You should probably learn to *read carefully*. Python is 10- 100x slower *in number crunching*. Doing I/O, almost all of the time will be taken by syscalls which take the same time in *any* language, hence in this cases the difference between python and C/C++ is **much** smaller, even non-existent in some cases. The OP is asking whether the overhead will be significant in its specific application. My bet is that it *might* be significant, but I bet it will smaller, probably much smaller, than a 10x slowdown. – Bakuriu Nov 07 '13 at 17:09
  • Anyway, without giving us what exactly you are trying to do and which limitations do you have I don't see how we can answer your question except for giving some general advice. `pandas` seems to be the best candidate for this job, however I cannot exclude that for your specific application there could be something better. – Bakuriu Nov 07 '13 at 17:11
  • Brian, the fastest way is to not parse times during CSV parsing but afterwards. http://stackoverflow.com/questions/11615504/parse-dates-when-yyyymmdd-and-hh-are-in-separate-columns-using-pandas-in-python/18527067#18527067 – K.-Michael Aye Nov 07 '13 at 18:21
  • 2
    @Exceptyon your comment is not helpful or well-informed. The pandas CSV file reading code is written mostly in C and thus would have similar performance characteristics to a file reader written solely in C++. – Wes McKinney Nov 07 '13 at 19:57
  • Thanks for the answers, I will try to be more specific: I get different CSV file formats from external sources in large volume. ATM we parse these files using a C++ parser which is "configured" using a DSL we wrote in boost::spirit. This allows customizing without recompiling the C++ code (e.g. for something like "CONVERT_ISO(UNIX_TIMESTAMP(EXTRACT(column1,2,10)))". The DSL builds the C++ parser tree once it reads the config file, after that the processing loop is raw C++ atm. Afaik Pandas also implements the parsing in C, and I would like to know just how efficient this is? Any experience? – VolkA Nov 08 '13 at 14:47
  • Just to add, the above code is a simple example to extract characters 2-10 from "column1" from the CSV, interpret that as a unix timestamp and output as an ISO date string. Other parsing involves checking values, e.g. "if(column4=="V") column2 else column3". The problem is that extending and maintaining the DSL with boost::spirit becomes to complex, and we are looking for Python as a powerful existing replacement, with Pandas as a "replacement" for the C++ processing core. An alternative I am looking into is using boost::python to wrap the C++ code. – VolkA Nov 08 '13 at 14:52

1 Answers1

1

Not to confuse you, but if you REALLY want speed, then HDF5 is the way to go, and to NOT even parse a csv (so a little pre-processing can go a long way). a related comparision of hdf and sql

In [9]: df = DataFrame(randn(1000000,2),columns=list('AB'))

In [10]: df['date'] = Timestamp('20130101')

In [11]: df.to_csv('test.csv',mode='w')

In [12]: df.to_hdf('test.h5','df',mode='w')

In [13]: %timeit read_csv('test.csv',index_col=0)
1 loops, best of 3: 948 ms per loop

In [14]: %timeit pd.read_hdf('test.h5','df')
10 loops, best of 3: 20.2 ms per loop
Community
  • 1
  • 1
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • I think the OP is describing an ETL job which HDF5 will not help you with, I don't think. – Wes McKinney Nov 07 '13 at 19:58
  • agreed. it is not clear whether intermediate storage is an option. – Jeff Nov 07 '13 at 20:01
  • 1
    Yes, thanks for the tip with HDF5, but this is targeted at parsing input files from different sources I have no control over, thus we need CSV parsers, and we are looking for the flexibility to customize parsing different fields using Python. – VolkA Nov 08 '13 at 14:41