3

I'm currently trying to delete duplicate rows in MySQL 5.7 (InnoDB) and am checking to see how many duplicates I have of a mediumtext column by running SELECT COLUMN, COUNT(*) FROM TABLE GROUP BY COLUMN HAVING COUNT(*) > 1. The most recent query returned:

[results]
31620 rows in set (17.98 sec)

If I run the exact same query a moment later, I get:

[results]
31594 rows in set (17.35 sec)

And so on. I get a different result almost every single time. Nothing is writing to the database during the query. It is doing this only with this query; SELECT COUNT(*) FROM TABLE, SELECT COUNT(*) FROM TABLE WHERE COLUMN LIKE <VALUE>, and so on, all produce consistent results. This error also does not occur when executing SELECT COLUMN, COUNT(*) FROM TABLE GROUP BY COLUMN HAVING COUNT(*) > 0.

I'm not sure what other code to offer to help answer this, as this is the only query I'm running and I'm doing it right in the console. I'm trying to think of what could possibly account for this. Given the other problems I've had with this same database, I'm wondering if it's possible that something is corrupted.

Edit: I've run 1000 queries to sample the results and they come out like this:

enter image description here

The upper limit of 33991 is the most common result.

The charset of the table is utf8mb4, and the collation of the column being aggregated is utf8mb4_general_ci.

The output of EXPLAIN SELECT COLUMN, COUNT(*) FROM COLUMN GROUP BY COLUMN HAVING COUNT(*) > 1; while using MyISAM:

+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra                           |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+---------------------------------+
|  1 | SIMPLE      | TABLE  | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 788685 |   100.00 | Using temporary; Using filesort |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+---------------------------------+

Results for InnoDB:

+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra                           |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+---------------------------------+
|  1 | SIMPLE      | TABLE  | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 769501 |   100.00 | Using temporary; Using filesort |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+---------------------------------+

Things I have tried so far as suggested in comments:

  • Memtest, using the memtest Linux package and running memtest 15G 2 (system has 16G of memory with 15.4 available and about .4 in use. This is a cloud machine and I can't boot with Memtest, although I've put in a request with the provider to see if they can.
  • Enabling general log, which showed no other activity being run between queries.
  • Using OPTIMIZE TABLE.
  • Dropping and re-adding the index.
  • Changing the table engine to MyISAM from InnoDB, which did seem to help a little as the query now reaches a max limit after a few queries, but it still bounces around for the initial few queries.
snapcrack
  • 1,761
  • 3
  • 20
  • 40
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/196433/discussion-on-question-by-snapcrack-selecting-duplicates-gives-different-result). – Bhargav Rao Jul 14 '19 at 00:36
  • Dump the table as SQL and import in a new DB instance and try again – Tarun Lalwani Jul 14 '19 at 06:09
  • Even if that worked, [I'd fear the backlash](https://stackoverflow.com/questions/2766785/fixing-lock-wait-timeout-exceeded-try-restarting-transaction-for-a-stuck-my/4797328#4797328) – snapcrack Jul 14 '19 at 07:27
  • For consistancy, use limits and orders. I am not sure what the data-set is, but what you are probably running into the fact that SQL is not pulling these values in a particular order. If your query is running into memory isuses (which for 30000+ mediumtext results might be the case), your query might be maxing out. Try to run the same query but put `ORDER BY COLUMN DESC` and see if you have a more consistent result. You might also try running the query at a lower limit and doing a binary increment (e.g. Limit 15000, then 25000) until you start to get inconsistent results. – Aaron Morefield Jul 16 '19 at 03:34
  • @AaronMorefield The thing is, this problem isn't raised if I do `having count(*) > 0`, which returns a much larger set of rows. Wouldn't these same memory constraints raise the same issues with that query as well? – snapcrack Jul 16 '19 at 04:11

2 Answers2

0

My limited knowledge of mysql is triggering my spidey sense regarding TEXT type columns, I think in TEXT type columns the default storage size in table is 256 and rest of the size of text is stored in some internal temp mysql tables. And since the "max_allowed_packet" property is different for mysql client and mysql server, i think there is a possibility that every time the mysql server sends you a different subset of entire text to your client and hence this ambiguity.

You should be able to increase the "max_allowed_packet" property for your mysql client and verify if you are indeed getting consistent results.

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
bi_noob
  • 57
  • 3
  • Wouldn't this error also show up when the query selects for `count(*) > 0` as well? – snapcrack Jul 19 '19 at 02:58
  • When you group by on mediumtext column, mysql database engine is reading (256 + some variable length of data) then there is a chance that this variable length of text is different every time so this error would work it's way to any combination of count query as long as you are grouping on this mediumtext column. You can try to eliminate it by selecting only a fixed length say first 200 chars and then see consistent results. – bi_noob Jul 22 '19 at 07:21
-1
POSSIBLE KEYS  |   KEY
NULL          |  NULL

It presents that when you performed group by, you're not using any index. Add a specific index on that column.

dodzb
  • 379
  • 2
  • 5