12

My Laravel 5.7 website has been experiencing a few problems that I think are related to each other (but happen at different times):

  1. PDO::prepare(): MySQL server has gone away
  2. E_WARNING: Error while sending STMT_PREPARE packet. PID=10
  3. PDOException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry (My database often seems to try to write the same record twice in the same second. I've been unable to figure out why or how to reproduce it; it doesn't seem to be related to user behavior.)
  4. Somehow, those first 2 types of errors only ever appear in my Rollbar logs but not on the text logs on the server or in my Slack notifications, as all errors are supposed to (and all others do).

For months, I've continued to see scary log messages like these, and I've been completely unable to reproduce these errors (and have been unable to diagnose and solve them).

I haven't yet found any actual symptoms or heard of any complaints from users, but the error messages seem non-trivial, so I really want to understand and fix the root causes.


I've tried changing my MySQL config to use max_allowed_packet=300M (instead of the default of 4M) but still get these exceptions frequently on days when I have more than a couple of visitors to my site.

I've also set (changed from 5M and 10M) the following because of this advice:

innodb_buffer_pool_chunk_size=218M
innodb_buffer_pool_size = 218M

As further background:

  • My site has a queue worker that runs jobs (artisan queue:work --sleep=3 --tries=3 --daemon).
  • There are a bunch of queued jobs that can be scheduled to happen at the same moment based on the signup time of visitors. But the most I see that have happened simultaneously is 20.
  • There are no entries in the MySQL Slow Query Log.
  • I have a few cron jobs, but I doubt they're problematic. One runs every minute but is really simple. Another runs every 5 minutes to send certain scheduled emails if any are pending. And another runs every 30 minutes to run a report.
  • I've run various mysqlslap queries (I'm completely novice though) and haven't found anything slow even when simulating hundreds of concurrent clients.
  • I'm using Laradock (Docker).
  • My server is DigitalOcean 1GB RAM, 1 vCPU, 25GB SSD. I've also tried 2GB RAM with no difference.
  • The results from SHOW VARIABLES; and SHOW GLOBAL STATUS; are here.

My my.cnf is:

[mysql]

[mysqld]
sql-mode="STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION"
character-set-server=utf8
innodb_buffer_pool_chunk_size=218M
innodb_buffer_pool_size = 218M
max_allowed_packet=300M
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow_query_log.log
long_query_time = 10
log_queries_not_using_indexes = 0

Any ideas about what I should explore to diagnose and fix these problems? Thanks.


Dharman
  • 30,962
  • 25
  • 85
  • 135
Ryan
  • 22,332
  • 31
  • 176
  • 357
  • 1
    Can you show us the constructed SQL statement. And `SHOW CREATE TABLE`. – Rick James Dec 03 '18 at 23:33
  • @RickJames I updated the question. How can I show the "constructed" MySql statement? (And since it only occasionally fails, it would be helpful to know how to debug / log what is different about the failure cases.) Thanks. – Ryan Dec 04 '18 at 01:16
  • If you would like to log the MYSQL of the query you can use this: `\DB::enableQueryLog();` before the query and this: `logger(\DB::getQueryLog() );` afterwards. This will save the query to your laravel log file – Josh Dec 06 '18 at 04:07
  • @josh Thanks. In my question above, I added a sample of what the `\DB::getQueryLog()` says, although it's just an example (and didn't coincide with this error actually happening on my production machine, since my production machine doesn't have this logging enabled, and the error doesn't always happen). – Ryan Dec 06 '18 at 20:09
  • @RickJames See my new comment and update. Thanks. – Ryan Dec 06 '18 at 20:09
  • 2
    @Ryan - next time it happens see if you can guess what the job_class_name was. Might it have been a really long string? With various escaped characters? (Etc). – Rick James Dec 06 '18 at 20:55
  • 2
    @Ryan - And... The optimal index for that query is a 4-column composite index: `INDEX(contact_id, job_class_name, execute_at, deleted_at)`. (Any rearrangement of those columns is just as good.) – Rick James Dec 06 '18 at 20:57
  • @RickJames Wow, today in my query log, I saw `'time' => 157.95`!! That's shocking. My table has <2000 records and has 3 indexes. I avoided indexing `deleted_at` because of https://stackoverflow.com/a/26530488/470749. But you're saying the best approach would be the 4-column composite index? I'm so confused about 2 things. A) How my query would ever take this long on so few records even if I had *no* indexes, and B) why my 3 separate indexes wouldn't be just as good as the proposed 4-column composite index, given what people have written about the cardinality of `deleted_at`. – Ryan Dec 12 '18 at 16:48
  • 1
    @Ryan - I am not suggesting a single-column `INDEX(deleted_at)`, but rather a 4-column 'composite' index. Meanwhile, I bled all over that other Q&A. – Rick James Dec 13 '18 at 04:33
  • 1
    @Ryan - A) No rational reason for 158 seconds. B) Think about two alphabetized lists -- one with a million last names and phone numbers, the other with first names and phone numbers. How would you find my phone number? Compare that to a list that was alphabetized on last, first. – Rick James Dec 13 '18 at 04:37
  • @Ryan - See if you can get `EXPLAIN SELECT ...` (with the values filled in). I don't expect it to help, but I'm getting desperate. – Rick James Dec 16 '18 at 00:27
  • @RickJames Good news it that 1 of my 4 problems wasn't a problem at all. My queries that I thought took 352 seconds really only took 0.352 seconds! ‍♂️ I've updated my question to remove that part. But I'm still seeing logged errors about `PDO::prepare(): MySQL server has gone away` really frequently. I'm going to see if I can temporarily disable cron jobs and queued jobs and hopefully eventually isolate a correlation. Maybe the problem has nothing to do with RAM or MySQL and is more like this person experienced with Facebook query slowness: https://stackoverflow.com/a/1885584/470749 – Ryan Jan 26 '19 at 22:29
  • what is your Laradoc config? share your docker-compose.yml? – Alex Jan 31 '19 at 18:37
  • @Alex It's this (https://raw.githubusercontent.com/laradock/laradock/85c57a0b6fa3af2358e524010b95970ff9677a1b/docker-compose.yml) except I changed version to 2.1, added this SSL in nginx (`- /apps/laradock_data/certs:/etc/nginx/ssl`), and these in mysql: `mem_limit: 300m` and `memswap_limit: 2g`. And added a few `restart: always` in various sections. – Ryan Jan 31 '19 at 21:24
  • You should remove all applications you don't use in your project: `percona, mssql, blackfire, postgressql, neo4j, mongoDB, aerospike... etc... etc..` You are running to much on the same instance :-) . Could you just share `docker ps` output please? ;-) and `docker stats` please – Alex Jan 31 '19 at 21:27
  • @Alex They are on separate tabs here now, thanks: https://docs.google.com/spreadsheets/d/1xMrpgdUFGkOgw_RP9ubIuE2seBsXTzKvl86z318T2B8/edit#gid=117055734 – Ryan Jan 31 '19 at 21:37
  • which php is losing connection? `php-fpm` or `php-worker`? – Alex Jan 31 '19 at 21:53
  • that is pretty difficult to debug remotely, but I would start from moving mysql out of docker and same instance. I am not sure about your load, but one shared vCPU can be the problem. – Alex Jan 31 '19 at 21:57
  • @Alex if `php-worker` is what processes queued jobs and cron jobs, then probably that one. – Ryan Jan 31 '19 at 22:19
  • could you please check what is average output of `show processlist;` and `show status where \`variable_name\` = 'Threads_connected';` when peak time – Alex Feb 01 '19 at 14:39
  • @Alex `SHOW FULL PROCESSLIST` at peak traffic was this: https://docs.google.com/spreadsheets/d/1xMrpgdUFGkOgw_RP9ubIuE2seBsXTzKvl86z318T2B8/edit#gid=1712520010 I won't know the answer to the other one til the next peak traffic time (24 hrs from now), but currently it says "10". Thanks. – Ryan Feb 02 '19 at 01:06
  • 1
    @Ryan You could try to work around the issue with a library like this one: https://github.com/adamturcsan/reconnecting-pdo It wraps/extends PDO and transparently reconnects & rebuilds prepared statements if the connection was lost. I also wrote a longer answer (see lower on this page) with some ideas about the root cause. – Mathieu Rey Apr 21 '19 at 06:39
  • Is your code called twice? I'm thinking preflight requests that have become a thing? – Herbert Van-Vliet Apr 14 '21 at 12:43

4 Answers4

5

Re Slowlog: Show us your my.cnf. Were the changes in the [mysqld] section? Test it via SELECT SLEEP(12);, then look both in the file and the table.

Alternate way to find the query: Since the query is taking several minutes, do SHOW FULL PROCESSLIST; when you think it might be running.

How much RAM do you have? Do not have max_allowed_packet=300M unless you have at least 30GB of RAM. Else you are risking swapping (or even crashing). Keep that setting under 1% of RAM.

For further analysis of tunables, please provide (1) RAM size, (2) SHOW VARIABLES; and (3) SHOW GLOBAL STATUS;.

Re deleted_at: That link you gave starts with "The column deleted_at is not a good index candidate". You misinterpreted it. It is talking about a single-column INDEX(deleted_at). I am suggesting a composite index such as INDEX(contact_id, job_class_name, execute_at, deleted_at).

158 seconds for a simple query on a small table? It could be that there is a lot of other stuff going on. Get the PROCESSLIST.

Re Separate indexes versus composite: Think of two indexes: INDEX(last_name) and INDEX(first_name). You flip through the last_name index to find "James", then what can you do? Flipping through the other index for "Rick" won't help you find me.

Analysis of VARIABLES and GLOBAL STATUS

Observations:

  • Version: 5.7.22-log
  • 1.00 GB of RAM
  • Uptime = 16d 10:30:19
  • Are you sure this was a SHOW GLOBAL STATUS ?
  • You are not running on Windows.
  • Running 64-bit version
  • You appear to be running entirely (or mostly) InnoDB.

The More Important Issues:

innodb_buffer_pool_size -- I thought you had it at 213M, not 10M. 10M is much too small. On the other hand, you seem to have less than that much data.

Since the RAM is so small, I recommend dropping tmp_table_size and max_heap_table_size and max_allowed_packet to 8M. And lower table_open_cache, table_definition_cache, and innodb_open_files to 500.

What causes so many simultaneous connections?

Details and other observations:

( innodb_buffer_pool_size / _ram ) = 10M / 1024M = 0.98% -- % of RAM used for InnoDB buffer_pool

( innodb_buffer_pool_size ) = 10M -- InnoDB Data + Index cache

( innodb_lru_scan_depth ) = 1,024 -- "InnoDB: page_cleaner: 1000ms intended loop took ..." may be fixed by lowering lru_scan_depth

( Innodb_buffer_pool_pages_free / Innodb_buffer_pool_pages_total ) = 375 / 638 = 58.8% -- Pct of buffer_pool currently not in use -- innodb_buffer_pool_size is bigger than necessary?

( Innodb_buffer_pool_bytes_data / innodb_buffer_pool_size ) = 4M / 10M = 40.0% -- Percent of buffer pool taken up by data -- A small percent may indicate that the buffer_pool is unnecessarily big.

( innodb_log_buffer_size / _ram ) = 16M / 1024M = 1.6% -- Percent of RAM used for buffering InnoDB log writes. -- Too large takes away from other uses for RAM.

( innodb_log_file_size * innodb_log_files_in_group / innodb_buffer_pool_size ) = 48M * 2 / 10M = 960.0% -- Ratio of log size to buffer_pool size. 50% is recommended, but see other calculations for whether it matters. -- The log does not need to be bigger than the buffer pool.

( innodb_flush_method ) = innodb_flush_method = -- How InnoDB should ask the OS to write blocks. Suggest O_DIRECT or O_ALL_DIRECT (Percona) to avoid double buffering. (At least for Unix.) See chrischandler for caveat about O_ALL_DIRECT

( innodb_flush_neighbors ) = 1 -- A minor optimization when writing blocks to disk. -- Use 0 for SSD drives; 1 for HDD.

( innodb_io_capacity ) = 200 -- I/O ops per second capable on disk . 100 for slow drives; 200 for spinning drives; 1000-2000 for SSDs; multiply by RAID factor.

( innodb_print_all_deadlocks ) = innodb_print_all_deadlocks = OFF -- Whether to log all Deadlocks. -- If you are plagued with Deadlocks, turn this on. Caution: If you have lots of deadlocks, this may write a lot to disk.

( min( tmp_table_size, max_heap_table_size ) / _ram ) = min( 16M, 16M ) / 1024M = 1.6% -- Percent of RAM to allocate when needing MEMORY table (per table), or temp table inside a SELECT (per temp table per some SELECTs). Too high may lead to swapping. -- Decrease tmp_table_size and max_heap_table_size to, say, 1% of ram.

( net_buffer_length / max_allowed_packet ) = 16,384 / 16M = 0.10%

( local_infile ) = local_infile = ON -- local_infile = ON is a potential security issue

( Select_scan / Com_select ) = 111,324 / 264144 = 42.1% -- % of selects doing full table scan. (May be fooled by Stored Routines.) -- Add indexes / optimize queries

( long_query_time ) = 10 -- Cutoff (Seconds) for defining a "slow" query. -- Suggest 2

( Max_used_connections / max_connections ) = 152 / 151 = 100.7% -- Peak % of connections -- increase max_connections and/or decrease wait_timeout

You have the Query Cache half-off. You should set both query_cache_type = OFF and query_cache_size = 0 . There is (according to a rumor) a 'bug' in the QC code that leaves some code on unless you turn off both of those settings.

Abnormally small:

( Innodb_pages_read + Innodb_pages_written ) / Uptime = 0.186
Created_tmp_files = 0.015 /HR
Handler_write = 0.21 /sec
Innodb_buffer_pool_bytes_data = 3 /sec
Innodb_buffer_pool_pages_data = 256
Innodb_buffer_pool_pages_total = 638
Key_reads+Key_writes + Innodb_pages_read+Innodb_pages_written+Innodb_dblwr_writes+Innodb_buffer_pool_pages_flushed = 0.25 /sec
Table_locks_immediate = 2.8 /HR
Table_open_cache_hits = 0.44 /sec
innodb_buffer_pool_chunk_size = 5MB

Abnormally large:

Com_create_db = 0.41 /HR
Com_drop_db = 0.41 /HR
Connection_errors_peer_address = 2
Performance_schema_file_instances_lost = 9
Ssl_default_timeout = 500

Abnormal strings:

ft_boolean_syntax = + -><()~*:&
have_ssl = YES
have_symlink = DISABLED
innodb_fast_shutdown = 1
optimizer_trace = enabled=off,one_line=off
optimizer_trace_features = greedy_search=on, range_optimizer=on, dynamic_range=on, repeated_subselect=on
session_track_system_variables = time_zone, autocommit, character_set_client, character_set_results, character_set_connection
slave_rows_search_algorithms = TABLE_SCAN,INDEX_SCAN
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • +1 for these hints. My server is DigitalOcean 1GB RAM, 1 vCPU, 25GB SSD. I've also tried 2GB RAM with no difference. It sounds like you strongly disagree with https://dba.stackexchange.com/questions/27328/how-large-should-be-mysql-innodb-buffer-pool-size/27341#27341. I've updated the question to include my `my.cnf` and `SHOW VARIABLES;` and `SHOW GLOBAL STATUS;`. Thanks for the composite index info. It seems like mine is good then. – Ryan Jan 26 '19 at 02:25
  • 1
    @Ryan - Well, with only 1GB of ram, you really need to lower `max_allowed_packet=300M`. – Rick James Jan 26 '19 at 03:07
  • 1
    @Ryan - I added my analysis of the tunables. – Rick James Jan 26 '19 at 03:29
  • Thank you so much. Although I don't understand most of this (and will try to study more and eventually implement more of it later), I've now updated my `my.cnf` to change ~20 settings and have rebuilt my Docker container and then verified that the new settings appear when I run `SHOW VARIABLES;` I just have good traffic later today and see if I notice anything different. Thanks. – Ryan Jan 26 '19 at 16:02
  • @Ryan - Which 20 settings? Any number that is _increased_ is likely to put you at risk of running out of the tiny 1GB of RAM. – Rick James Jan 26 '19 at 16:45
  • These are the settings I'm trying now: https://docs.google.com/spreadsheets/d/1xMrpgdUFGkOgw_RP9ubIuE2seBsXTzKvl86z318T2B8/edit#gid=1899926392 – Ryan Jan 26 '19 at 17:32
  • `PDO::prepare(): MySQL server has gone away` has been happening nearly every minute now! I also just got a Slack notification that `\DB::getQueryLog()` logged a recent query of my automations table as having `'time' => 10.68`. `SELECT @@global.long_query_time;` shows 2.0, but `select * from mysql.slow_log;` does not show that long query. `SHOW FULL PROCESSLIST;` results in https://docs.google.com/spreadsheets/d/1xMrpgdUFGkOgw_RP9ubIuE2seBsXTzKvl86z318T2B8/edit#gid=1712520010 – Ryan Jan 26 '19 at 18:30
  • I changed back to `max_allowed_packet=300M` thinking it would help because https://stackoverflow.com/a/4587554/470749 and https://dev.mysql.com/doc/refman/5.7/en/gone-away.html, but I continue to see "PDO::prepare(): MySQL server has gone away" happening. I've stopped running ads because even though **I** haven't seen any symptoms, I'm worried that visitors are. My landing page conversion rate was far lower than usual. – Ryan Jan 26 '19 at 19:03
  • Swapping makes mysql slooooooooooow. – Rick James Jan 26 '19 at 19:56
2

I encountered the same situation on a long-running PHP CLI script (it listens on a Redis list ; each action is quick but the script basically runs for ever).

I create the PDO object and a prepared statement at the beginning, then reuse them afterwards.

The day after I started the script, I got the exact same errors:

PHP Warning:  Error while sending STMT_EXECUTE packet. PID=9438 in /...redacted.../myscript.php on line 39

SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

In my case, it's a development server, there is no load, MySQL is on the same box... so it's unlikely to come from external factors. It's most likely related to the fact I used the same MySQL connection for too long, and it timed out. And PDO doesn't bother, so any subsequent query will just return "MySQL server has gone away".

Checking the value of "wait_timeout" in MySQL:

mysql> show variables like 'wait_timeout';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wait_timeout  | 28800 |
+---------------+-------+
1 row in set (0.06 sec)

mysql> show local variables like 'wait_timeout';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wait_timeout  | 28800 |
+---------------+-------+
1 row in set (0.00 sec)

I see 28800 seconds = 8 hours, which seems coherent with the timing of my errors.

In my case, restarting the MySQL server, or setting wait_timeout very low, while keeping the same PHP worker running, makes it very easy to reproduce the issue.

Overall:

  • PDO doesn't care if the connection times out, and will not automatically reconnect. If you put a try/catch around your PDO queries, the script will never crash and keep using the obsolete PDO instance.
  • the STMT_EXECUTE warning is probably incidental ; just because the script whose connection timed out was using prepared statements, and the first query post-timeout happened to be using a prepared statement

To get back to your case

  • in theory Laravel 5 is immune to this issue: https://blog.armen.im/en/laravel-4-and-stmt_prepare-error/ ; do you use something other than Illuminate, or even bare PDO directly? Also, I'm not sure what Laravel does when it detects a lost connection (does it reconnect and rebuild prepared statements?), it might be worth digging further.
  • check your MySQL wait_timeout value, and increase it if it's too low
  • if it's not happening all the time, see if the errors correlate with server / DB load. High load can make things (especially big SQL queries) several times slower, to the point some other MySQL timeout like max_execution_time gets reached.
  • see if you wrapped PDO queries in a try / catch block and use it to retry the query ; it might be preventing the connection error from bubbling up.
Mathieu Rey
  • 603
  • 1
  • 9
  • 13
  • In the case of my long running workers, I worked around the issue thanks to this library: https://github.com/adamturcsan/reconnecting-pdo It wraps/extends PDO and transparently reconnects & rebuilds prepared statements if the connection was lost. – Mathieu Rey Apr 21 '19 at 06:36
  • +1 for the effort, but it's no longer a priority for me to try to solve this since inexplicably the error messages for #1 and #2 have stopped. I never like when bugs happen without me eventually understanding the root cause, but I'm focusing elsewhere now. I'll check out this answer more carefully if I see these errors again. Thanks. – Ryan Apr 22 '19 at 12:48
0

If you see this message randomly, possible reasons:

  1. Your MySQL is behind a proxy, and they are using different timeout config.

  2. You are using PHP's persist connection.

You may try to dig into the problem by these steps:

  1. Make sure your connections to MySQL have long enough timeout (eg: proxy setting, MySQL's wait_timeout / interactive_timeout)

  2. Disable the persist connection at PHP side.

  3. Do some tcpdump if you can to see what happend when you got the error message.

shawn
  • 4,305
  • 1
  • 17
  • 25
  • https://stackoverflow.com/a/15802020/470749 agrees that persistent connections aren't ideal. I checked my `/etc/php/7.2/cli/php.ini`, and it says `mysqli.allow_persistent = On` even though https://stackoverflow.com/a/51849038/470749. But Laravel doesn't use persistent connections anyway: https://fideloper.com/laravel-pdo-connection-options. I'm not sure what "Your MySQL is behind a proxy" means. My production server does use Docker containers. I'm using https://laradock.io/. Thanks. – Ryan Dec 10 '18 at 15:28
  • I just added a 500 bounty. Any new ideas? Thanks. – Ryan Jan 26 '19 at 01:22
  • 1
    @Ryan If there is no slow query in log, but your client (php) waits too long, there is usually a network problem. MySQL only record slow query in log, but if your network is laggy, MySQL won't treat that lag as slow. I think `tcpdump` would help, it will show not only the network problem, but also what happened with the MySQL protocol. – shawn Jan 26 '19 at 04:20
  • 1
    @Ryan some tips about how to `tcpdump` on a busy site: you can write a script to do `tcpdump` every minute and save to individual files with timestamp in filename, it can reduce the captured data size. And when you see something wrong, just look into the captured data at that time. – shawn Jan 26 '19 at 04:24
  • I almost got this to work: `cd /apps/tcpdumps && sudo bash -c "nohup tcpdump -i eth0 -G 300 -W 3 -w trace-%dd-%Hh-%Mm.pcap &"` but it seems to stay open in my terminal (doesn't give me the cursor back on the next line), so I must be doing it wrong. Also, I'm not sure what I'd be looking for in the pcap files anyway. – Ryan Jan 26 '19 at 21:07
  • You can try to `sudo bash` first to get a root shell and run your scripts, and you may need a `while` loop in your script. – shawn Jan 27 '19 at 03:51
0

I had an interesting observation on my CentOS machine, running PHP7.2x, Apache 2.4.6. After changing the host in my CodeIgniter config from '127.0.0.1' to 'localhost' the problem disappeared.

So change:

'hostname' => '127.0.0.1' to 'hostname' => 'localhost'

I reverted the configuration a couple of times to double check, but somehow this does the trick all the time...

BartjeV
  • 63
  • 1
  • 4
  • 127.0.0.1!= localhost. If you use localhost, mysql will use a Unix socket for the connection. With 127.0.0.1 it will be a tcp/ip connection (see https://dev.mysql.com/doc/refman/5.7/en/can-not-connect-to-server.html). According to the documentation, the performance difference should be minimal. Im not an expert on this topic, but if this solves your issue it might indicate a problem with the tcp/ip stack – derflocki Jul 11 '21 at 05:56