My organisation set up a Spark Thrift server that is configured to use SSL over HTTP. The intent is to enable Power BI to retrieve data via Spark securely. However, simply retrieving schema information can take up to 10 minutes, and a further 10+ mins for the first 1000 rows of data!
Clearly, this is unworkable so we set about on a process of elimination. We captured a huge amount of data and additional details, but I think our discoveries can be distilled down to:
- Wireshark was used on the Power BI computer. This showed Power BI was spending most of its time waiting for packets: not the client’s processing.
- We used the Admin UI to get the exact commands that Power BI was issuing to the spark thrift server: the client’s commands were not efficient but not unreasonable.
- Beeline was used (on another machine in the same cluster) to connect and execute the exact same commands that Power BI was executing: execution was FAST.
- Simba ODBC drivers were used (on a workstation) to connect and execute a simple SELECT * command: execution was slow (1 second per row retrieved).
- TCP dump was used on the Thrift Server. This showed most of the time was spent waiting for the thrift server to send packets: with #1, this is not a network latency issue.
- We changed server config to ‘Standard’ or binary protocol, connected with Power BI: execution was FAST!
- We reverted server config to ‘HTTP’ but without SSL: execution was SLOW.
Do these bits of information point to any holes in my elimination process or obvious potential problems that we have missed?
So this seems to point to a problem specifically with the use of HTTP (over port 10001)?