Reading large Parquet file from SFTP with Pyspark is slow

Question

I'm having some issue reading data (parquet) from a SFTP server with SQLContext.

The Parquet file is quite large (6M rows).
I found some solutions to read it, but it's taking almost 1hour..

Below is the script that works but too slow.

import pyarrow as pa  
import pyarrow.parquet as pq 
from fsspec.implementations.sftp import SFTPFileSystem
  
fs = SFTPFileSystem(host = SERVER_SFTP, port = SERVER_PORT, username = USER, password = PWD)

df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs)

When the data is not in some sftp server, I use the below code, which usually works well even with large file. So How can I use SparkSQL to read a remote file in SFTP?

df = sqlContext.read.parquet('PATH/file')

Things that I tried: using SFTP library to open but seems to loose all the advantage of SparkSQL.

df = sqlContext.read.parquet(sftp.open('PATH/file'))

I also tried to use spark-sftp library, following this article without success: https://www.jitsejan.com/sftp-with-spark

score 1 · Answer 1 · answered Jul 23 '22 at 05:49

The fsspec uses Paramiko under the hood. And this is known problem with Paramiko:
Reading file opened with Python Paramiko SFTPClient.open method is slow

In fsspec, it does not seem to be possible to change the buffer size.

But you can derive your own implementation from SFTPFileSystem that does:

def BufferedSFTPFileSystem(SFTPFileSystem):
    def open(self, path, mode='rb'):
        return super().open(self, path, mode, bufsize=32768)

score 1 · Accepted Answer · edited Jul 25 '22 at 10:10

1

By adding the buffer_size parameter in the pyarrow.parquet library, the computational time went from 51 to 21 minutes :)

df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs, buffer_size = 32768)

Thanks @Martin Prikryl for your help ;)

edited Jul 25 '22 at 10:10

Martin Prikryl

188,800
56
490
992

answered Jul 25 '22 at 08:35

A2N15

595
4
20

Reading large Parquet file from SFTP with Pyspark is slow

2 Answers2