3

My website can handle 40,000 people or more simultaneously and run fast but a search engine bot will kill mysql. It's been driving me insane because once the bots come, the site will show "Could not connect: too many connections" and I have to manually restart mysqld to get the website back up. I have been tackling this for a year. I have made so many adjustments to apache and mysql tuning and nothing seems to work. I have changed the max_connections from 300 to 1800 to 10000 and that does not fix the bot problem.

I use Amazon linux and have a huge instance/server. Ram is not an issue. I have done countless tech supports and they find nothing wrong EVER. SO I have to assume it has something to do with my programming. I do not use Wordpress, I built my site from scratch but like I said it can handle 40,000 people no problem. Bots though, crash it.

my connect script is simple:

$connect=mysql_connect("localhost","user","password"); 
if (!$connect)
  {
  die('Could not connect: ' . mysql_error());
  }
mysql_select_db("db",$connect);

The odd thing is, there is always "1" for current connections even if there is 2000 people on the site. So that is why I feel like I'm doing something wrong with connecting to the database.

Does anyone have experience or advice on keeping a site running at all times with heavy bot traffic? PLEASE!!! I repeat, this is not an increase max_connections issue.

Chris Filippou
  • 377
  • 3
  • 11
  • 2 questions: 1) do you keep the connection open for the entire user session? or is it closed and opened more than once per session. 2) when you have 40K people - you mean SIMULTANEOUSLY? and you still see 1 db connection?! – carmel Feb 26 '17 at 19:41
  • I run the connect script, then after the select queries I close it (even though it closes on its own). Yeah I can have 40k simultaneously and it will show 1 connection in ssh show processlist. And I know that is strange. So that is why I feel like I'm doing something wrong. But the database can handle all those people fine! Just not bots and I am so tired of this!! – Chris Filippou Feb 26 '17 at 19:45
  • Presumably the bots hit your web site, not your mysqld directly. Does anything in particular characterize your bot traffic? Can you tell it's bots from the user-agent strings, or if not, how do you tell? Have you tried, when you detect bots, responding with 503 (temporary overload) status? Does that resolve the problem? Have you tried upgrading your code from the notoriously unreliable and deprecated `mysql_` interface to `PDO` or `mysqli_`? – O. Jones Feb 26 '17 at 19:46
  • 1. [The mysql_ functions have been removed from PHP](http://stackoverflow.com/questions/12859942/why-shouldnt-i-use-mysql-functions-in-php) 2. Have you tried monitoring your DB? MySQL Workbench can run a constant `SHOW PROCESSLIST;` and show you what queries are being run 3. Have you considered moving the DB off the server? Maybe into an RDS instance – Machavity Feb 26 '17 at 19:46
  • I have no method of detecting bots and responding with 503. I can tell its bots by my access log. I will do a tail and see a bunch of bots and then I'll run the top ip's in access log and do a host command and see the bot. I think I get a lot of spam bots too and have blocked many of those in htaccess but I need to allow googlebot and those. I have thought about RDS too, just haven't done it yet. – Chris Filippou Feb 26 '17 at 19:54
  • Have you something like fail2ban on your server? You could whitelist the bots you accept and ban those you don't want/know, it would help your server I think. – Veve Feb 28 '17 at 09:31

1 Answers1

1

MySQL is accepting new connections, but can't handle all the queries. The number of waiting connections will just pile up, until there are to many.

Problem isn't really with MySQL, it's the bots that are misbehaving. You probably don't need all those bots scanning your whole site each time. Luckily you have some control over them.

Step 1: Create a robots.txt and disallow all bots, except the ones you care about.

User-agent: google
Disallow:

User-agent: yahoo
Disallow:

User-agent: msn
Disallow:

User-agent: *
Disallow: /

Step 2: Create a sitemap. Setting the last modified time of each page, means the bots will only hit the changed pages on your site. You can create a sitemap dynamically (querying your DB) using PHP library: thepixeldeveloper/sitemap.

I the example, we're assuming that you have a database with a pages table. The table has a permalink and last_modified column.

// sitemap.php

$urlSet = new Thepixeldeveloper\Sitemap\Urlset(); 

// Adding the URL for '/' to the XML map
$homeUrl = (new Thepixeldeveloper\Sitemap\Url('/'))
  ->setChangeFreq('daily')
  ->setPriority(1.0);

$urlSet->addUrl($homeUrl);

// Add URL of each page to sitemap
$result = mysql_query("SELECT permalink, last_modified FROM pages");

while ($page = mysql_fetch_asoc($result)) {
    $url = (new Thepixeldeveloper\Sitemap\Url($page['permalink']))
      ->setLastMod($page['last_modified'])
      ->setChangeFreq('monthly')
      ->setPriority(0.5);

    $urlSet->addUrl($url);
}

header('Content-Type: text/plain');
echo (new Thepixeldeveloper\Sitemap\Output())->getOutput($sitemapIndex);

You can use a rewrite rule in Apache (or similar in other HTTP server) to rewrite sitemap.xml to sitemap.php.

RewriteEngine On
RewriteRule sitemap.xml sitemap.php [L]

This should be sufficient, though there might be bots that do not respect robots.txt. Detect them and block them (by IP and/or User agent) in your HTTP server configuration


Also consider the following:

Max connections are there, so your server doesn't overload. You should do a benchmark test to determine the maximum number of parallel requests that your application can handle. Than reduce that number by 20% and set that as maximum in both your HTTP webserver and MySQL configuration.

This means your server will give a nice 503 Service Unavailable response before it's overloads. This will make (well behaving) bots give up and try again later, meaning your system will restore without manual attention.

Your script should also exit with the correct HTTP response.

$connect = mysql_connect("localhost", "user", "password"); 
if (!$connect) {
  header("HTTP/1.1 503 Service Unavailable");
  echo 'Could not connect: ' . mysql_error();
  exit();
}
mysql_select_db("db", $connect);
Arnold Daniels
  • 16,516
  • 4
  • 53
  • 82
  • Thank you SO much. I am going to try and implement all this and see how it goes today. – Chris Filippou Mar 01 '17 at 13:13
  • I'm not gonna lie, I know the sitemap generator will be a huge help to reduce bots crawling the same pages over and over but I don't understand the package you sent. I don't see where I input the info to my database, etc. – Chris Filippou Mar 01 '17 at 13:34
  • @ChrisFilippou I'm assuming that you're pages aren't static, but come from a database. You query your database to create the sitemap, you don't write anything to the DB. I don't know how your database look like, but I'll add something the the answer still. – Arnold Daniels Mar 02 '17 at 04:51
  • @Jansy, my database is structured terribly so it pulls from multiple tables for each category instead of a [category row] to sort the pages/posts. So basically I have a separate table for news, a table for videos, a table for blog posts, etc. How can I pull from them all this way? They all have a [date_time] row in common (instead of last modified). I know it's a pain and I appreciate this help SO much. – Chris Filippou Mar 02 '17 at 16:12
  • @ChrisFilippou It all should be pretty obvious. If you have multiple tables, you do multiple queries, each time adding the results to the urlset. – Arnold Daniels Mar 04 '17 at 08:25
  • @ChrisFilippou If this answer is helpful, please upvote and accept it. – Arnold Daniels Mar 04 '17 at 08:27
  • okay I think I get the query part for the tables. Do I need to install thepixeldeveloper/sitemap library using ssh? I see nothing on installation on that link. Something like wget https://github.com/ThePixelDeveloper/Sitemap.git ? – Chris Filippou Mar 06 '17 at 00:58