0

Very very strange issue here... Apologies in advance for the wall of text.

We have a suite of applications running on an EC2 instance, all connecting to an RDS instance.

We are hosting the staging and production applications on the same EC2 server.

With one of the applications, as soon as the staging app is moved to prod, over 250 or so connections to the DB are opened, causing the RDS instance to max out CPU usage and make the entire suite slow down. The staging application itself does not have this issue.

The issue can be replicated by both deploying the app via our Octopus setup, and also physically copy pasting the BIN/Views folder from staging to live.

The connections are instant, boosting the CPU usage to 99% in less than a minute.

Things to note...

Running how to see active SQL Server connections? will show the bulk connections, none of which have a LoginName.

Resource monitor on the FE server will list the connections, all coming from a IIS, seemingly scanning all outbound ports, attempting to connect to the DB server on its port. FE server address and DB server address blacked out respectively. Only a snippet of all all of the connections. Resource monitor example

The app needs users to log in to perform 99.9% of tasks. There is a public "Forgot your password" method that was updated to accept either a username or password. No change to the form structure or form action URL, just an extra check in the back.

Other changes were around how data was to be displayed and payment restrictions under certain conditions. Both of which require a login.

Things I've tried...

  • New app pools
  • Just giving it a few days to forget this ever happened
  • Not using Octopus to publish
  • Checking all areas that were updated between versions to see if a connection was not closed properly.

Really at a loss as to what is happening. This is the first time that I've seen something like this. Especially strange that staging is fine, but the same app on another URL/Connection string fails so badly.

The only think I can think of would potentially be some kind of scraper that is polling the public form, but that makes no sense as why isn't it happening with the current app...

Is there something in AWS that can monitor the calls that are being made? I vaguely remember something in NewRelic being able to do so.

Any suggestions and/or similar experiences are welcomed.

Edits.

  • Nothing outstanding in logs for the day of the issue (yesterday)
  • No incoming traffic to match all of the outbound requests
  • No initialisation is performed by the application on startup

Update...

We use ADO for most of our queries. A query was updated to get data from different tables. The method name and parameters were not changed, just the body of the query. If I use sys.dm_exec_sql_text to see what is getting sent to the DB, I can see that is IS the updated query that is being sent in each of the hundreds of connections. They are all showing as suspended though... Nothing has changed in regards to how that query is sent to the server, just the body of the query itself...

Community
  • 1
  • 1
  • Check your IIS W3C logs. Do you have a ton of inbound traffic corresponding to these requests? Or pretty quiet? Do you see anything suspicious, like redirect loops? – John Wu Mar 15 '17 at 00:24
  • Running through the logs now. There doesn't seem to be any inbound traffic. DB connections are low, 15 for example, when we switch to test the issue again. Will check for redirect loops... – ryanjhilton Mar 15 '17 at 00:37
  • So maybe you have a startup issue. Does the application access the database as part of its initialization routine (e.g. to obtain settings)? I'm guessing it is falling over and retrying again and again. You may have some sort of environment-specific configuration, or perhaps permissions, that are out of whack. – John Wu Mar 15 '17 at 00:49
  • Thanks John. There is no initialisation that is performed. Even so, it would fail on the staging DB too? Both DBs were created equally. – ryanjhilton Mar 15 '17 at 00:56

1 Answers1

0

So, one of the other queries that was published in the update broke it. We reverted only that query and deployed a new version, and it is fine.

Strangely enough, it's one that is being run in one form or another over the entire suite. But just died under any sort of load that wasn't staging, which is why I assumed it would be the last place to look.