1

I tried to pass class paramiko.sftp_file.SFTPFile instead of file URL for pandas.read_parquet and it worked fine. But when I tried the same with Dask, it threw an error. Below is the code I tried to run and the error I get. How can I make this work?

import dask.dataframe as dd
import parmiko
ssh=paramiko.SSHClient()
sftp_client = ssh.open_sftp()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
source_file=sftp_client.open(str(parquet_file),'rb')
full_df = dd.read_parquet(source_file,engine='pyarrow')
print(len(full_df))
Traceback (most recent call last):
  File "C:\Users\rrrrr\Documents\jackets_dask.py", line 22, in <module>
    full_df = dd.read_parquet(source_file,engine='pyarrow')
  File "C:\Users\rrrrr\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\dataframe\io\parquet.py", line 1173, in read_parquet
    storage_options=storage_options
  File "C:\Users\rrrrr\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\bytes\core.py", line 368, in get_fs_token_paths
    raise TypeError('url type not understood: %s' % urlpath)
TypeError: url type not understood: <paramiko.sftp_file.SFTPFile object at 0x0000007712D9A208>
Martin Prikryl
  • 188,800
  • 56
  • 490
  • 992
Rahul
  • 161
  • 2
  • 6

2 Answers2

1

Dask does not support file-like objects directly.

You would have to implement their "file system" interface.

I'm not sure what is minimal set of methods that you need to implement to allow read_parquet. But you definitely have to implement the open. Something like this:

class SftpFileSystem(object):
    def open(self, path, mode='rb', **kwargs):
        return sftp_client.open(path, mode)

dask.bytes.core._filesystems['sftp'] = SftpFileSystem

df = dd.read_parquet('sftp://remote/path/file', engine='pyarrow')

There's actually am implementation of such file system for SFTP in fsspec library:
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.sftp.SFTPFileSystem

See also Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?


Obligatory warning: Do not use AutoAddPolicy – You are losing a protection against MITM attacks by doing so. For a correct solution, see Paramiko "Unknown Server".

Martin Prikryl
  • 188,800
  • 56
  • 490
  • 992
1

The situation has changed, and you can do this now directly with Dask. Paster answer from Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?

In the master version of Dask, file-system operations are now using fsspec which, along with the previous implementations (s3, gcs, hdfs) now supports some additional file-systems, see the mapping to protocol identifiers fsspec.registry.known_implementations.

In short, using a url like "sftp://user:pw@host:port/path" should now work for you, if you install fsspec and Dask from master.

mdurant
  • 27,272
  • 5
  • 45
  • 74