1

I'm using Rails 5.2 with ruby 2.5.1 and am deploying my app to Heroku.

I ran into problems when I tried running my local rake task. The task calls an API which responds with a *.gz file, saves it, upzips and then uses the retrieved JSON to populate the database and finally deletes the *.gz file. The task runs smooth in development but when called in production. The last line printed into the console is 'Unzipping the file...', so my guess is that the issues origin from the zlib library.

companies_list.rake

require 'json'
require 'open-uri'
require 'zlib'
require 'openssl'
require 'action_view'

include ActionView::Helpers::DateHelper

desc 'Updates Company table'
task update_db: :environment do
  start = Time.now
  zip_file_url = 'https://example.com/api/download'

  TEMP_FILE_NAME = 'companies.gz'

  puts 'Creating folders...'

  tempdir = Dir.mktmpdir
  file_path = "#{tempdir}/#{TEMP_FILE_NAME}"

  puts 'Downloading the file...'

  open(file_path, 'wb') do |file|
    open(zip_file_url, { ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE }) do |uri|
      file.write(uri.read)
    end
  end

  puts 'Download complete.'
  puts 'Unzipping the file...'

  gz = Zlib::GzipReader.new(open(file_path))
  output = gz.read
  @companies_array = JSON.parse(output)

  puts 'Unzipping complete.'

  (...)
end

Has anyone else run into similar issues and knows how to get it to work?

Joanna Gaudyn
  • 597
  • 3
  • 21
  • Haven't worked with zlib in heroku. Can you try running a system command, `gunzip -l file.gz`. If the file is downloaded properly and is not corrupt, you should see the file contents. try it out from the console. – arjun Jun 13 '18 at 12:52
  • I just ran heroku -logs and realised the problem was somewhere else: the file returned by the API exceeded our Heroku quota (Error R15 (Memory quota vastly exceeded)... – Joanna Gaudyn Jun 14 '18 at 11:33
  • Are each response with a gz file, different? If not, you should consider caching those requests. You can also offload extensive memory eating operations to other services (AWS). Heroku can just serve your app. – arjun Jun 14 '18 at 11:56
  • It's an API containing company data from an official registry, so the idea is to run the script every night to get updates (in case there are any) - so yes, in theory the responses might be different. We're looking into using a JSON streamer (with not much success so far). – Joanna Gaudyn Jun 14 '18 at 12:17

2 Answers2

1

Your code snippet does not indicate that you ever close your GzipReader. It is often best to wrap IO's in blocks to ensure they are closed appropriately. Also, the open method may not be the one you want, so just let GzipReader handle opening the file for you and just send in the file_path.

Zlib::GzipReader.new(file_path) do |gz|
  output = gz.read
  @companies_array = JSON.parse(output)
end
Sixty4Bit
  • 12,852
  • 13
  • 48
  • 62
  • 1
    I used [this SO](https://stackoverflow.com/questions/8684125/how-do-i-read-a-gzip-file-line-by-line) to get my gzip file unpacked and as mentioned above the script runs locally. Seems like my problem is linked to memory usage (which is limited depending on your Heroku plan), so I'm trying to look into JSOM streaming to hopefully find a workaround. – Joanna Gaudyn Jun 13 '18 at 13:57
0

The issue was linked to memory limit rather than Gzip unpacking (that's why the problem only occurred in production).

The solution was using a Json::Streamer so that the whole file is not loading into memory at once.

This is the crucial part: (goes after the code posted in the question)

  puts 'Updating the Company table...'
  streamer = Json::Streamer.parser(file_io: file, chunk_size: 1024)  # customize your chunk_size
  streamer.get(nesting_level: 1) do |company|
    (do your stuff with the API data here...)
  end
end
Joanna Gaudyn
  • 597
  • 3
  • 21