0

I want to read lines of file from SFTP server. There are more than 100000 lines in the file.

I am reading in 2 ways.

Net::SSH.start(setting.host, setting.user,
  {
    :key_data => [ key ],
    :keys => [],
    :keys_only => true
  }
) do |ssh|
  ssh.sftp.connect do |sftp|
    sftp.dir.foreach(src_dir) do |entry|
      if entry.name.include? today
        filename = "#{src_dir}/#{entry.name}"
        sftp.file.open(filename, "r") do |f|

          # Way 1
          f.readlines.each do |line|
            parse(line)
          end

          # Way 2
          while line = f.gets do
            parse(line)
          end
        end
      end
    end 
  end
end

I want to know which way is better in memory usage.

Remy Wang
  • 666
  • 6
  • 26
  • "[Why is “slurping” a file not a good practice?](https://stackoverflow.com/questions/25189262)" goes into this. Reading a big file as an array can cause real problems as Ruby tries to allocate memory. – the Tin Man Dec 30 '19 at 21:02

2 Answers2

2

What do the docs say? (Note that File is a subclass of IO. The methods #readlines and #gets are defined on IO.)

IO#readlines:

Reads all of the lines […], and returns them in an array.

IO#gets:

Reads the next “line” from the I/O stream.

Thus, I expect the latter to be better in terms of memory usage as it doesn't load the entire file into memory.

fphilipe
  • 9,739
  • 1
  • 40
  • 52
  • Thank you, I thought so. – Remy Wang Dec 30 '19 at 14:12
  • "as it doesn't load the entire file into memory" - this is a non-sequitur. The part you quoted does not say anything about what is loaded into memory. Only the behaviour. This is, therefore, dependent entirely on _what_ a particular IO object is. – Sergio Tulentsev Dec 30 '19 at 14:45
  • Take `open-uri`, for example. It produces you an IO object, and yet a `open('http://example.com/file.txt').gets` will download the entire file before reading a first line from it. And StringIO objects hold their entire content in memory _by definition_. – Sergio Tulentsev Dec 30 '19 at 14:47
  • @SergioTulentsev, yes, you're right. By "latter" I was referring to the latter example in the question, related to `File`. – fphilipe Dec 30 '19 at 14:55
  • @fphilipe: ah, but in the question it's not a plain `File`, is it? It's whatever `Net::SFTP` gives us. – Sergio Tulentsev Dec 30 '19 at 14:57
  • @SergioTulentsev, yes, you are right :) In fact, the `file` variable is a [`Net::SFTP::Operations::File`](https://github.com/net-ssh/net-sftp/blob/master/lib/net/sftp/operations/file.rb) instance, which does not inherit from `IO`, but mimics it. – fphilipe Dec 30 '19 at 15:07
  • 1
    Digging deeper, this class will read 8192 bytes at a time as required ([source](https://github.com/net-ssh/net-sftp/blob/acbd1f7d435bb8bcc89a31b91bdfb6c8943df640/lib/net/sftp/operations/file.rb#L184)). – fphilipe Dec 30 '19 at 15:14
  • See "[What happens if you answered a question, questioner says thanks, but didn't accept your answer as correct?](https://meta.stackexchange.com/questions/109773)" and "[Do you feel dirty if you nudge new users to accept your answer when they indicate you've answered their question?](https://meta.stackexchange.com/questions/14994)" – the Tin Man Dec 30 '19 at 20:59
1

Generally cycles are faster than blocks because of scope.

And arrays take much memory.

#readlines reads all of the lines in ios, and returns them in an array.

#gets reads the next line from the I/O stream

I wrote little benchmark for file with 1387085 lines.

Also added ::readlines that reads the entire file specified by name as individual lines, and returns those lines in an array and ::foreach that executes the block for every line in the named I/O port.

require 'benchmark/ips'
require 'benchmark/memory'

@path = File.join(__dir__, 'file.txt')

def open_readlines
  File.open(@path, 'r') do |f|
    f.readlines.each do |line|
      line << 'www'
    end
  end
end

def open_gets
  File.open(@path, 'r') do |f|
    while line = f.gets do
      line << 'www'
    end
  end
end

def readlines
  File.readlines(@path).each do |line|
    line << 'www'
  end
end

def foreach
  File.foreach(@path) do |line|
    line << 'www'
  end
end

%i[ips memory].each do |benchmark|
  puts benchmark

  Benchmark.send(benchmark) do |x|
    x.report('::open #readlines') { open_readlines }
    x.report('::open #gets') { open_gets }
    x.report('::readlines') { readlines }
    x.report('::foreach') { foreach }

    x.compare!
  end
end

And results are:

ips
Warming up --------------------------------------
   ::open #readlines     1.000  i/100ms
        ::open #gets     1.000  i/100ms
         ::readlines     1.000  i/100ms
           ::foreach     1.000  i/100ms
Calculating -------------------------------------
   ::open #readlines      0.575  (± 0.0%) i/s -      3.000  in   5.397538s
        ::open #gets      0.746  (± 0.0%) i/s -      4.000  in   5.381583s
         ::readlines      0.570  (± 0.0%) i/s -      3.000  in   5.434956s
           ::foreach      0.826  (± 0.0%) i/s -      5.000  in   6.057936s

Comparison:
           ::foreach:        0.8 i/s
        ::open #gets:        0.7 i/s - 1.11x  slower
   ::open #readlines:        0.6 i/s - 1.44x  slower
         ::readlines:        0.6 i/s - 1.45x  slower

memory
Calculating -------------------------------------
   ::open #readlines   822.274M memsize (     8.424k retained)
                         2.774M objects (     1.000  retained)
                        50.000  strings (     0.000  retained)
        ::open #gets   810.638M memsize (     0.000  retained)
                         2.774M objects (     0.000  retained)
                        50.000  strings (     0.000  retained)
         ::readlines   822.274M memsize (     0.000  retained)
                         2.774M objects (     0.000  retained)
                        50.000  strings (     0.000  retained)
           ::foreach   810.638M memsize (     0.000  retained)
                         2.774M objects (     0.000  retained)
                        50.000  strings (     0.000  retained)

Comparison:
           ::foreach:  810638012 allocated
        ::open #gets:  810638052 allocated - 1.00x more
         ::readlines:  822274324 allocated - 1.01x more
   ::open #readlines:  822274364 allocated - 1.01x more
mechnicov
  • 12,025
  • 4
  • 33
  • 56