I'm having some issue reading data (parquet) from a SFTP server with SQLContext.
The Parquet file is quite large (6M rows).
I found some solutions to read it, but it's taking almost 1hour..
Below is the script that works but too slow.
import pyarrow as pa
import pyarrow.parquet as pq
from fsspec.implementations.sftp import SFTPFileSystem
fs = SFTPFileSystem(host = SERVER_SFTP, port = SERVER_PORT, username = USER, password = PWD)
df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs)
When the data is not in some sftp server, I use the below code, which usually works well even with large file. So How can I use SparkSQL to read a remote file in SFTP?
df = sqlContext.read.parquet('PATH/file')
Things that I tried: using SFTP library to open but seems to loose all the advantage of SparkSQL.
df = sqlContext.read.parquet(sftp.open('PATH/file'))
I also tried to use spark-sftp library, following this article without success: https://www.jitsejan.com/sftp-with-spark