0

I want to download and process csv file that is on sftp server line by line. If I am using download! or sftp.file.open, it is buffering whole data in memory that I want to avoid.

Here is my source code:

sftp = Net::SFTP.start(@sftp_details['server_ip'], @sftp_details['server_username'], :password => decoded_pswd)
  if sftp
    begin
      sftp.dir.foreach(@sftp_details['server_folder_path']) do |entry|
        print_memory_usage do
          print_time_spent do
            if entry.file? && entry.name.end_with?("csv")
              batch_size_cnt = 0
              sftp.file.open("#{@sftp_details['server_folder_path']}/#{entry.name}") do |file|
                header = file.gets
                header = header.force_encoding(header.encoding).encode('UTF-8', invalid: :replace, undef: :replace, replace: '')
                csv_data = ''
                while line = file.gets
                  batch_size_cnt += 1
                  csv_data.concat(line.force_encoding(line.encoding).encode('UTF-8', invalid: :replace, undef: :replace, replace: ''))
                  if batch_size_cnt == 1000 || file.eof?
                    CSV.parse(csv_data, {headers: header, write_headers: true}) do |row|
                      row.delete(nil) 
                      entities << row.to_hash       
                    end
                    csv_data, batch_size_cnt = '', 0
                    courses.delete_if(&:blank?)
                    # DO PROCESSING PART
                    entities = []
                  end
                end if header
              end
              sftp.rename("#{@sftp_details['server_folder_path']}/#{entry.name}", "#{@sftp_details['processed_file_path']}/#{entry.name}")
            end
          end
        end
end

Can someone please help? Thanks

tukan
  • 17,050
  • 1
  • 20
  • 48
Kiran Kumawat
  • 82
  • 1
  • 10
  • 1
    Welcome to SO. Before we can help you we encourage some kind of effort. What did you try? – tukan Jul 06 '18 at 08:39
  • @tukan I have tried sftp.file.open that buffer 32k bytes at a time and creating batches of 1000 records to save them. – Kiran Kumawat Jul 06 '18 at 12:17
  • Well could you update your question with the actual code? Please see - https://stackoverflow.com/help/mcve for more – tukan Jul 06 '18 at 15:35
  • @tukan here is gist url https://gist.github.com/kiku1705/7c7dff4d9e3710ce3ed5802283e8dedb. Sorry for late reply. – Kiran Kumawat Jul 09 '18 at 04:07
  • useful: https://stackoverflow.com/questions/2538613/is-there-anything-in-the-ftp-protocol-like-the-http-range-header – Sergio Tulentsev Jul 09 '18 at 07:24

1 Answers1

1

You need to add some kind of buffer to be able to read chunks and then write them all together. I think it would be wise to split in your script parsing and downloading. Focus on one thing at the time:

Your original line:

   ...
   sftp.file.open("#{@sftp_details['server_folder_path']}/#{entry.name}") do |file|
   ...

If you check the source file of the download! (don't forget the bang!) method you can use 'stringio'. A stub which you can easily adjust. Usually the default buffer, which is 32kB, is sufficient. You can change it if you want (see the example).

Replace with (works only with single files) :

The StringIO usage:

   ...
  io = StringIO.new
  sftp.download!("#{@sftp_details['server_folder_path']}/#{entry.name}", io.puts, :read_size => 16000))

OR you can just download a file

  ...
  file = File.open("/your_local_path/#{entry.name}",'wb')
  sftp.download!("#{@sftp_details['server_folder_path']}/#{entry.name}", file, :read_size => 16000)
  ....

From the Doc's you can use an option :read_size:

:read_size - the maximum number of bytes to read at a time from the source. Increasing this value might improve throughput. It defaults to 32,000 bytes.

tukan
  • 17,050
  • 1
  • 20
  • 48