4

Node.js developer here who has to work with Ruby, so I'm pretty new to a lot of concepts in Ruby and could use some help.

My use case is that I have to download very large newline delimited JSON files from S3, transform the data, and put it back to S3, all in memory without writing anything to disk.

In Node, I can do something like this:

s3DownloadStream('my-file').pipe(transformStream).pipe(backToS3Stream)

which will transform objects on the fly as they come in and put them to S3 concurrently.

I am having trouble finding a good plan of action to achieve this same behavior in Ruby. I have seen IO.pipe and Celluloid::IO as possible options, but they still don't seem quite like they will be able to do this.

aloisbarreras
  • 557
  • 6
  • 14
  • Maybe this will help: https://aws.amazon.com/blogs/developer/downloading-objects-from-amazon-s3-using-the-aws-sdk-for-ruby/ – Alexandre Angelim Feb 17 '17 at 02:25
  • @AlexandreAngelim I saw that post, but it seems like that is either for downloading a large file to disk or to an in memory IO. I didn't see anything in that post about being able to pipe the download through a transform and simultaneously back to s3. I imagine I'm going to have to use fork or Thread.new, but i am hoping to get a real world example of someone doing something similar I can build off of. – aloisbarreras Feb 17 '17 at 02:54
  • The above link gets you most of the way there. Look at the code under "Using Blocks." Instead of writing each chunk to a file, process the chunk however you want and then upload the result to S3 (using, I would assume, the multipart upload API). – Jordan Running Feb 17 '17 at 16:46

1 Answers1

1

Ruby doesn't have a direct analogue to streams in Node, but it has the Enumerable iterator framework and through that there's the Lazy option. A lazy enumerator is one that only emits data as necessary, unlike the others that will run to completion each time.

If you set up a lazy chain it will evaluate bit by bit, not all at once.

So your code will look like:

s3_download('my-file').lazy.map do |...|
  # transform stream
end.each do |...|
  # pipe back to S3
end

Here's a trivial example you can build on:

input = ('a'..'z')

input.lazy.map do |i|
  puts 'i=%s' % i

  i.upcase
end.each do |j|
  puts '  j=%s' % j
end

You can see how each value ripples through the chain individually. If you remove lazy that's not the case, the first loop runs to completion, buffering into an array, and then the second kicks in and processes that to completion as well.

Node streams are a lot more complicated than this, they can do things like pause/resume, defer an operation without blocking, and more, so there's only so much overlap in terms of functionality. Ruby can do this if you spend the time to use things like fibers and threads, but that's a lot of work.

tadman
  • 208,517
  • 23
  • 234
  • 262