I have a pyspark pipeline that should export a table as CSV file to HDFS and to SFTP server (data will be taken by CRM team after).
To export to HDFS, it is very sample and it works like a charm, But to export data to sftp file I did this:
def export_to_sftp():
dataframe.coalesce(1).options(codec=compression).write.mode("overwrite").option('encoding',encoding).csv(file_to_hdfs, header=True, nullValue='', sep=';')
copyToLocalFile(file_to_hdfs,local_machine_file) # copy from HDFS to LOCAL using hadoop API
cnopts = pysftp.CnOpts()
hostkeys = None
if cnopts.hostkeys.lookup(server) is not None:
hostkeys = cnopts.hostkeys
cnopts.hostkeys = None
try:
with pysftp.Connection(host=server, username=login,
password=password, cnopts=cnopts) as sftp:
if hostkeys is not None:
hostkeys.add(server, sftp.remote_server_key.get_name(), sftp.remote_server_key)
hostkeys.save(pysftp.helpers.known_hosts())
sftp.put(local_machine_file, sftp_path)
except Exception as e:
log.exception(e)
finally:
log.info("Cleaning up)
shutil.rmtree(local_tmp)
This method works fine when files are not too large but for some tables it doesn't works as in my local linux machine I don't have enough disk space,
So is it possible to use pysftp to copy a remote file from HDFS to SFTP in stream method without copy to local machine?