5

I have a CSV file in AWS S3 that I'm trying to open in a local temp file. This is the code:

s3 = Aws::S3::Resource.new
bucket = s3.bucket({bucket name})
obj = bucket.object({object key})
temp = Tempfile.new('temp.csv')
obj.get(response_target: temp)

It pulls the file from AWS and loads it in a new temp file called 'temp.csv'. For some files, the obj.get(..) line throws the following error:

WARN: Encoding::UndefinedConversionError: "\xEF" from ASCII-8BIT to UTF-8
WARN: /Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `write'
/Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `block in delegating_block'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/http/response.rb:62:in `signal_data'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/net_http/handler.rb:83:in `block (3 levels) in transmit'
...
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/client.rb:2666:in `get_object'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/object.rb:657:in `get'

Stacktrace shows the error initially gets thrown by the .get from the AWS SDK for Ruby.

Things I've tried:

When uploading the file (object) to AWS S3, you can specify content_encoding, so I tried setting that to UTF-8:

obj.upload_file({file path}, content_encoding: 'utf-8')

Also when you call .get you can set response_content_encoding:

obj.get(response_target: temp, response_content_encoding: 'utf-8')

Neither of those work, they result in the same error as above. I would really expect that to do the trick. In the AWS S3 dashboard I can see that the content encoding is indeed set correctly via the code but it doesn't appear to make a difference.

It does work when I do the following, in the first code snippet above:

temp = Tempfile.new('temp.csv', encoding: 'ascii-8bit')

But I'd prefer to upload and/or download the file from AWS S3 with the proper encoding. Can someone explain why specifying the encoding on the tempfile works? Or how to make it work through the AWS S3 upload/download?

Important to note: The problematic character in the error message appears to just be a random symbol added at the beginning of this auto-generated file I'm working with. I'm not worried about reading the character correctly, it gets ignored when I parse the file anyways.

kaydanzie
  • 386
  • 3
  • 7
  • 17
  • 1
    The problematic character is Byte Order Mark (BOM), which I believe is not compatible with UTF-8. You'll want to do some searches on it. – dmulter Aug 08 '18 at 21:47
  • Where is the error coming from? What's the full stack trace? – Casper Aug 08 '18 at 21:58
  • @Casper the majority of the stack trace I feel is not relevant and contains some personal info that would take a lot of time to remove. I've added the top and bottom of the stack trace and clarified the line in the code that throws the error. I'm pretty confident I've narrowed down where the issue is. – kaydanzie Aug 08 '18 at 22:13
  • Ok I was able to replicate this problem by creating an ascii-8bit string containing BOM and trying to write it into a Tempfile. The question is this: why is the AWS gem treating the data internally as ascii-8bit. – Casper Aug 08 '18 at 22:46

3 Answers3

8

I don't have a full answer to all your question, but I think I have a generalized solution, and that is to always put the temp file into binary mode. That way the AWS gem will simply dump the data from the bucket into the file, without any further re/encoding:

Step 1 (put the Tempfile into binmode):

temp = Tempfile.new('temp.csv')
temp.binmode

You will however have a problem, and that is the fact that there is a 3-byte BOM header in your UTF-8 file now.

I don't know where this BOM came from. Was it there when the file was uploaded? If so, it might be a good idea to strip the 3 byte BOM before uploading.

However, if you set up your system as below, it will not matter, because Ruby supports transparent reading of UTF-8 with or without BOM, and will return the string correctly regardless of if the BOM header is in the file or not:

Step 2 (process the file using bom|utf-8):

File.read(temp.path, encoding: "bom|utf-8")
# or...
CSV.read(temp.path,  encoding: "bom|utf-8")

This should cover all your bases I think. Whether you receive files encoded as BOM + UTF-8 or plain UTF-8, you will process them correctly this way, without any extra header characters appearing in the final string, and without errors when saving them with AWS.

Another option (from OP)

Use obj.get.body instead, which will bypass the whole issue with response_target and Tempfile.

Useful references:
Is there a way to remove the BOM from a UTF-8 encoded file?
How to avoid tripping over UTF-8 BOM when reading files
What's the difference between UTF-8 and UTF-8 without BOM?
How to write BOM marker to a file in Ruby

kaydanzie
  • 386
  • 3
  • 7
  • 17
Casper
  • 33,403
  • 4
  • 84
  • 79
  • 1
    The solution to use File.read or CSV.read does not appear to solve the issue because there is still the problem of getting the file from S3. Or are you suggesting to pass the S3 path to File.read or CSV.read? That could be a solution. I've decided that the best way to go about it is to just avoid the process of putting it into a local file from S3 and instead just parse the contents directly from S3. I can do `obj.get.body` to return the contents instead of `obj.get(response_target: ...)` – kaydanzie Aug 09 '18 at 06:39
  • 1
    The solution to getting the file is the first step: put the Tempfile into binmode. This just disables any encoding or processing by Ruby, and you will get the raw data to disk. Because the error you're getting comes from the AWS gem internally, when it does a `write` into the Tempfile. Now a good question is why is the AWS gem tagging your data as ASCII-8BIT internally. I think it's even worth opening a bug report on Github for that. They may have an explanation for that behavior (or maybe it is even a bug). – Casper Aug 09 '18 at 12:54
  • And btw. `obj.get.body` sounds fine too. You can save it manually to a file or do whatever is needed with that. I would however be curious to know in what encoding this is `obj.get.body.encoding`? Anyway, it seems you have enough options to choose from now. – Casper Aug 09 '18 at 13:06
  • 2
    The encoding for obj.get.body says UTF-8. Thank you for the advice, I'll look into possibly opening an issue with the AWS SDK gem. I'll mark this as the accepted answer because I think it could help people with the particular question I had. – kaydanzie Aug 09 '18 at 15:23
  • 1
    Thanks. The fact that it returns UTF-8 makes it even more strange why the gem is treating it as ASCII-8BIT internally. Btw. one more solution is that `response_target` can be a file name directly. Perhaps you knew this, but if you generate your own unique file name, and provide that as `response_target` then the gem will dump the data into that file. Could work better than Tempfile if you just want to get the data into a file. – Casper Aug 09 '18 at 15:54
  • Thank you! Very much!! – Megapiharb May 21 '20 at 20:26
1

I fixed this encoding issue by using File.open(tmp, 'wb') additionally. Here is how it looks like:

s3_object = Aws::S3::Resource.new.bucket("bucket-name").object("resource-key")

Tempfile.new.tap do |file|
   s3_object.get(response_target: File.open(file, "wb"))
end
Fei
  • 1,187
  • 1
  • 16
  • 35
0

The Ruby SDK docs have an example of downloading an S3 item to the filesystem in https://docs.aws.amazon.com/sdk-for-ruby/v3/developer-guide/s3-example-get-bucket-item.html. I just ran it and it works fine.

Doug Schwartz
  • 85
  • 1
  • 3
  • Yes it does. However, in the example in the question I asked you will see the exact scenario where the example you linked to does not work. Regardless of specifying encoding when uploading or downloading a file to S3, the S3 gem cannot properly pull a file with this BOM at the beginning into a temp file without specifying the encoding on the temp file. – kaydanzie Aug 13 '18 at 18:53