1

I'm having some issue reading data (parquet) from a SFTP server with SQLContext.

The Parquet file is quite large (6M rows).
I found some solutions to read it, but it's taking almost 1hour..

Below is the script that works but too slow.

import pyarrow as pa  
import pyarrow.parquet as pq 
from fsspec.implementations.sftp import SFTPFileSystem
  
fs = SFTPFileSystem(host = SERVER_SFTP, port = SERVER_PORT, username = USER, password = PWD)

df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs)

When the data is not in some sftp server, I use the below code, which usually works well even with large file. So How can I use SparkSQL to read a remote file in SFTP?

df = sqlContext.read.parquet('PATH/file')

Things that I tried: using SFTP library to open but seems to loose all the advantage of SparkSQL.

df = sqlContext.read.parquet(sftp.open('PATH/file'))

I also tried to use spark-sftp library, following this article without success: https://www.jitsejan.com/sftp-with-spark

Martin Prikryl
  • 188,800
  • 56
  • 490
  • 992
A2N15
  • 595
  • 4
  • 20

2 Answers2

1

The fsspec uses Paramiko under the hood. And this is known problem with Paramiko:
Reading file opened with Python Paramiko SFTPClient.open method is slow

In fsspec, it does not seem to be possible to change the buffer size.

But you can derive your own implementation from SFTPFileSystem that does:

def BufferedSFTPFileSystem(SFTPFileSystem):
    def open(self, path, mode='rb'):
        return super().open(self, path, mode, bufsize=32768)
Martin Prikryl
  • 188,800
  • 56
  • 490
  • 992
1

By adding the buffer_size parameter in the pyarrow.parquet library, the computational time went from 51 to 21 minutes :)

df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs, buffer_size = 32768)

Thanks @Martin Prikryl for your help ;)

Martin Prikryl
  • 188,800
  • 56
  • 490
  • 992
A2N15
  • 595
  • 4
  • 20