17

Rails 4 - Ruby 2.2.2 - Amazon AWS S3 - dragonfly 1.0.12 - dragonfly-s3_data_store 1.2 - fog-aws 0.10.0

Around 99% of the time we have no issues. The issue usually only happens during times when usage is high but I noticed it happen when there were almost no users as well. The line that throws the error:

 # excon/lib/excon/socket.rb
 # line 100 inside the connection method.
 addrinfo = ::Socket.getaddrinfo(*args)

The error happens everywhere in the application. Sometimes the error is seen when there is not a remote connection. - I am no longer able to verify this.

I used Rails loggers to capture the arguments being passed in and there is seemingly no difference between a pass and a fail. Here are some examples:

 # PASS
 ["s3.amazonaws.com", 443, 0, 1, nil, nil, false]
 ["mybucket.s3.amazonaws.com", 443, 0, 1, nil, nil, false]

 # FAIL
 ["mybucket.s3-us-west-1.amazonaws.com", 443, 0, 1, nil, nil, false]

I came across several forums that lead me to believe an update was needed to the excon gem. I upgraded the Excon gem from 0.45.4 to 0.51.0. In addition to that I also updated the Fog gem from 1.36.0 to 1.38.0.

After upgrading the error went from "getaddrinfo: Name or service not known (SocketError)" to "Excon::Error::Socket: getaddrinfo: No address associated with hostname (SocketError)"

The url captured for a failed response is different than one of the urls that passes. I will look in to this further.

UPDATE:

The dragonfly initializer specifies the same path as the one that fails and because url_host overrides the default functionality I decided to remove it.

 # myapp/config/initializers/dragonfly.rb
 ...
 url_host: 'mybucket.s3-us-west-1.amazonaws.com'

This resulted in no change. The same url is still used and is the only one that fails.

Peter Black
  • 1,142
  • 1
  • 11
  • 29
  • Could you share some of the pass/fail arguments for reference? Thanks. – geemus Aug 03 '16 at 17:34
  • The loggers were taken out when we upgraded the gem. I will add successful args now but I will not be able to provide a failed args list until tomorrow. – Peter Black Aug 03 '16 at 18:54
  • It would seem as if the url is getting "-us-west-1" appended to it. This may be the cause of my woes. – Peter Black Aug 04 '16 at 15:35
  • Hmm. It might be failed redirect following (which occurs when the connection and bucket are in different regions). Some of that can be a bit wonky at times. – geemus Aug 04 '16 at 19:52
  • @geemus do you have any advice for dealing with and/or debugging this issue? – Peter Black Aug 15 '16 at 14:47
  • If you have only that URL failing, maybe you have your region written wrong somewhere, [like here](http://stackoverflow.com/questions/22588089/socketerror-at-getaddrinfo-nodename-nor-servname-provided-or-not-known-padri)? Or if it fails from time to time, then I'd say that it's an excon problem since `getaddrinfo()` is about DNS and it might fail for various reasons (so maybe there is a need to retry). – Roman Khimov Aug 15 '16 at 18:55
  • How frequently is it failing? If it is a long running process, it might also be a caching issue? (ie DNS is correct on initial connection, but changes later). I'm not sure that would be very likely, but perhaps. I suppose it could also signal a broader networking error, but I would imagine that would show up more dramatically (and less regularly). Is it possible that some of the objects would be in a different region? This might also lead to issues. – geemus Aug 16 '16 at 19:00
  • We encounter this error fairly infrequently for the amount of things that go through excon. I would estimate we see the error 3 - 20 times a day. We have 1000s of users. We only have one bucket. I will contact Amazon and get more information about how our bucket is hosted. – Peter Black Aug 16 '16 at 22:12
  • Yeah, afraid I haven't heard of other cases like this so I don't readily have other advise. On some level, networks are not to be trusted, so perhaps retries will be sufficient. Still, I would expect this to be hit more broadly if it were a general issue with S3 (and it is not being hit broadly to the best of my knowledge). – geemus Aug 25 '16 at 18:49
  • Is your application running on an EC2 instance? What operating system? If Linux, what's the output of `sysctl net.ipv4.ip_local_port_range`? – mwp Sep 06 '16 at 07:39
  • its s3. net.ipv4.ip_local_port_range = 32768 61000 – Peter Black Sep 28 '16 at 14:36

2 Answers2

1

I had this error, too. In my case, the culprit was either the server load (a slow file upload) or it was special characters in the filename. Since you also see this during low usage times, you might want to look at the filenames that people upload. For me the error typically occurred, when someone uploaded a file with German umlauts (ä,ö,ü,ß) in the name of the file.

So please try to upload a file with some special character in the name and tell us whether this reproduces the error faithfully.

If this is the case, then simply escape the special characters or name the file differently. Here is a description of the special characters issue: https://github.com/markevans/dragonfly-s3_data_store/issues/6.

morgler
  • 1,669
  • 1
  • 18
  • 26
  • We are seeing the error on simple get requests as well as file upload/download. Sometimes the error is simply: SocketError: getaddrinfo: Name or service not known without the "Exconn" portion as stated above. – Peter Black Jan 05 '17 at 14:32
  • Ok, if you're getting the error on simply GET requests, my explanation does not fit your case. I have no idea beyond what I described from my own experience above. Sorry :( – morgler Jan 07 '17 at 13:35
0

It might not solve your problem but i have seen something like this in two cases-

  1. Firewall restricted the port my system was configured to.
  2. My authorization/authentication credentials were wrong/outdated.
Rais
  • 69
  • 5