0

I have a pyspark pipeline that should export a table as CSV file to HDFS and to SFTP server (data will be taken by CRM team after).

To export to HDFS, it is very sample and it works like a charm, But to export data to sftp file I did this:

def export_to_sftp():
    dataframe.coalesce(1).options(codec=compression).write.mode("overwrite").option('encoding',encoding).csv(file_to_hdfs, header=True, nullValue='', sep=';')
    copyToLocalFile(file_to_hdfs,local_machine_file) # copy from HDFS to LOCAL using hadoop API
    cnopts = pysftp.CnOpts()
    hostkeys = None
    if cnopts.hostkeys.lookup(server) is not None:
        hostkeys = cnopts.hostkeys
        cnopts.hostkeys = None
        try:
            with pysftp.Connection(host=server, username=login,
                               password=password, cnopts=cnopts) as sftp:
                if hostkeys is not None:
                    hostkeys.add(server, sftp.remote_server_key.get_name(), sftp.remote_server_key)
                    hostkeys.save(pysftp.helpers.known_hosts())
                    sftp.put(local_machine_file, sftp_path)
        except Exception as e:
            log.exception(e)
        finally:
            log.info("Cleaning up)
            shutil.rmtree(local_tmp)

This method works fine when files are not too large but for some tables it doesn't works as in my local linux machine I don't have enough disk space,

So is it possible to use pysftp to copy a remote file from HDFS to SFTP in stream method without copy to local machine?

Martin Prikryl
  • 188,800
  • 56
  • 490
  • 992
Majdi
  • 73
  • 8

1 Answers1

0

If I understand your question correctly, you are looking for something like below.

with hdfs.open("/hadoop/path/filename") as f,
     sftp.putfo("/sftp/path/filename", f)

Obligatory warning: Do not set cnopts.hostkeys = None, unless you do not care about security. For the correct solution see Verify host key with pysftp.

Martin Prikryl
  • 188,800
  • 56
  • 490
  • 992