This is a follow up question to Haskell Thrift library 300x slower than C++ in performance test.
So recently I decided I was going to play around with Thrift and noticed that it was almost prohibitively slow and I stumbled upon the post mentioned above so I decide to deep dive :) The original question had some suggestions for improving the poster's code but there wasn't much in terms of suggestions for improving the library code. I'm not incredibly experienced with Haskell but thought I would give it a try.
I got the source code for Thrift (https://github.com/apache/thrift/tree/master/lib/hs) and rebuilt it with profiling enabled so I could get an idea of how the library was performing... in a single word, poorly, but that may be too generous. I decided to just focus on the ping test since that appeared to be the biggest offender. Here's the profiler output (with a few cost centered added to break down the expensive method).
First, here's how I built everything to get the times (and with profiling added to get the numbers below):
- GHC Version: 7.8.4
- Thrift Version: 0.9.2 (cloned from GitHub repo)
Building Thrift:
runghc Setup.lhs configure --user -O2 && runghc Setup.lhs build && runghc Setup.lhs install
Building Main:
ghc Main -O2
Times:
- Haskell: 3960ms
- C++: 5ms
... ouch :(
Here's the profiler output (with a few cost centered added to break down the expensive method).
COST CENTRE MODULE %time %alloc
SCC_readMessageBegin ItemStore 24.0 29.8
SCC_writeMessageBegin ItemStore 24.0 18.9
MAIN MAIN 20.0 2.7
SCC_THRIFT_parseBinaryValue_ver Thrift.Protocol.Binary 8.0 4.3
CAF GHC.Event.Thread 4.0 0.0
SCC_THRIFT_parseBinaryValue_s Thrift.Protocol.Binary 4.0 9.3
SCC_THRIFT_parseBinaryValue_decodeUtf8 Thrift.Protocol.Binary 4.0 1.5
proc_ ItemStore 4.0 1.9
read_Ping_args ItemStore 4.0 3.3
main Main 4.0 1.3
unstreamChunks/inner Data.Text.Internal.Lazy.Fusion 0.0 1.6
SCC_THRIFT_parseBinaryValue_sz Thrift.Protocol.Binary 0.0 5.0
process_ping ItemStore 0.0 2.1
write_Ping_result ItemStore 0.0 14.2
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 124 0 20.0 2.7 100.0 100.0
main Main 250 0 0.0 0.0 72.0 95.0
process ItemStore 251 1 0.0 0.8 72.0 94.9
SCC_readMessageEnd ItemStore 282 100 0.0 0.1 0.0 0.1
SCC_proc_ ItemStore 261 100 0.0 0.1 32.0 44.1
proc_ ItemStore 262 100 4.0 1.9 32.0 44.0
process_ping ItemStore 268 100 0.0 2.1 28.0 40.8
SCC_tFlush ItemStore 281 100 0.0 0.7 0.0 0.7
SCC_writeMessageEnd ItemStore 280 100 0.0 0.1 0.0 0.1
SCC_write_Ping_result ItemStore 277 100 0.0 0.0 0.0 14.2
write_Ping_result ItemStore 278 100 0.0 14.2 0.0 14.2
from_Ping_result ItemStore 279 100 0.0 0.0 0.0 0.0
SCC_writeMessageBegin ItemStore 273 100 24.0 18.9 24.0 20.2
unstreamChunks/outer Data.Text.Internal.Lazy.Fusion 274 100 0.0 0.1 0.0 1.3
unstreamChunks/inner Data.Text.Internal.Lazy.Fusion 275 400 0.0 0.8 0.0 1.2
unstreamChunks/resize Data.Text.Internal.Lazy.Fusion 276 100 0.0 0.4 0.0 0.4
SCC_Iface.ping ItemStore 271 100 0.0 0.0 0.0 0.1
ping Main 272 100 0.0 0.1 0.0 0.1
SCC_read_Ping_args ItemStore 269 100 0.0 0.1 4.0 3.3
read_Ping_args ItemStore 270 100 4.0 3.3 4.0 3.3
unstreamChunks/outer Data.Text.Internal.Lazy.Fusion 263 100 0.0 0.1 0.0 1.3
unstreamChunks/inner Data.Text.Internal.Lazy.Fusion 264 400 0.0 0.8 0.0 1.2
unstreamChunks/resize Data.Text.Internal.Lazy.Fusion 265 100 0.0 0.4 0.0 0.4
SCC_readMessageBegin ItemStore 252 1 24.0 29.8 40.0 49.9
SCC_THRIFT_parseBinaryValue_ver Thrift.Protocol.Binary 254 0 8.0 4.3 16.0 20.1
SCC_THRIFT_parseBinaryValue_s Thrift.Protocol.Binary 255 100 4.0 9.3 8.0 15.8
SCC_THRIFT_parseBinaryValue_decodeUtf8 Thrift.Protocol.Binary 260 0 4.0 1.5 4.0 1.5
SCC_THRIFT_parseBinaryValue_sz Thrift.Protocol.Binary 256 100 0.0 5.0 0.0 5.0
SCC_THRIFT_parseBinaryValue_decodeUtf8 Thrift.Protocol.Binary 259 100 0.0 0.0 0.0 0.0
SCC_THRIFT_parseBinaryValue_ver Thrift.Protocol.Binary 257 100 0.0 0.0 0.0 0.0
Yikes. You'll see the largest offenders (readMessageBegin (containing SCC_THRIFT_parseBinaryVal_XYZ) and writeMessageBegin) constitute 64% of the total time. All that these methods do is read 2 ints and a string!. This is clearly unacceptable.
If you look at Thrift/Protocol.hs (https://github.com/apache/thrift/blob/master/lib/hs/src/Thrift/Protocol.hs) and Thrift/Protocol/Binary.hs (https://github.com/apache/thrift/ blob/master/lib/hs/src/Thrift/Protocol/Binary.hs) you'll notice they are using Attoparsec to parse the binary header which consists of version :: Int32, a string, and the message size :: Int32.
First off it's strange (is it?) to use Attoparsec to read data that you already know the exact structure of. Second, in Protocol.hs, runParser reads from whatever transport protocol is being used 1 byte at a time!! and uses the Partial feature of Attoparsec and continuously reapplies the parser. Additionally, it appears they are going from a lazy ByteString to a strict ByteString then back to a lazy ByteString several times per parse. None of these things can be efficient.
I tried to do some modifications such as reading more than 1 byte at a time (but that just caused it to hang), to use tReadAll (from Transport.hs) to read the entire contents from the handle, but that hangs as well. I also tried a few other things (such as using Data.Binary directly) that ended up causing exceptions to be thrown about "no more data to read" and other such things.
Is there a fundamental reason why the Trift code is performing so poorly and what is the more idiomatic and efficient way to do this? Appologies for the long question, I just wanted to make sure there was enough context and I'm really interested in making libraries like this for Haskell better.