4

Say I have a message defined in test.proto as:

message TestMessage {
    int64 id = 1;
    string title = 2;
    string subtitle = 3;
    string description = 4;
}

And I use protoc to convert it to Python like so:

protoc --python_out=. test.proto

timeit for PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python:

from test_pb2 import TestMessage

%%timeit
tm = TestMessage()
tm.id = 1
tm.title = 'test title'
tm.subtitle = 'test subtitle'
tm.description = 'this is a test description'

6.75 µs ± 152 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

timeit for PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp:

1.6 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Compare that to just a dict:

%%timeit
tm = dict(
    id=1,
    title='test title',
    subtitle='test subtitle',
    description='this is a test description'
)

308 ns ± 2.47 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

This is also only for one message. Protobuf cpp implementation is about 10.6µs for my full project.

Is there a way to make this faster? Perhaps compiling the output (test_pb2)?

Brendan Martin
  • 561
  • 6
  • 17
  • Protocol buffers are widely-used, and pretty well-optimized already, so I doubt it. Also, you don't really "compile" a Python source file; you could use a different interpreter if you needed to (pypy, etc.). But in any case, do you have reason to believe that serialization specifically is a bottleneck in your application? – bnaecker May 13 '20 at 23:32
  • @bnaecker I was thinking there might be a way to output c++ and call those messages from python by building with setup.py somehow. It's a bottleneck for me because I'm parsing millions of rows of data into proto messages and it's taking 15+ hours – Brendan Martin May 14 '20 at 00:47
  • Do you mean write a C++ executable to do the serialization, and then call that from Python? If so, that would be more expensive than what you have (you need to get the data from Python to C++, which is...serialization, plus process overhead). Have you tried the standard tools for parallelizing CPU-bound work, like [`ProcessPoolExecutor`](https://docs.python.org/3/library/concurrent.futures.html?highlight=processpoolexecutor#concurrent.futures.ProcessPoolExecutor), [`joblib`](https://joblib.readthedocs.io/en/latest/) or similar? – bnaecker May 14 '20 at 01:26
  • @bnaecker I found this example which might be what I'm looking for https://yz.mit.edu/wp/fast-native-c-protocol-buffers-from-python/ – Brendan Martin May 14 '20 at 01:48
  • What protobuf and python versions are you using? – Ilan.K Aug 30 '20 at 20:17
  • @Ilan.K I'm using Python 3.8 and Protobuf version 3.9.2 – Brendan Martin Aug 30 '20 at 20:27
  • 2
    Hey, @BrendanMartin. Did you solve this issue? – peppered Feb 07 '22 at 16:50

0 Answers0