Identify duplicate fields in a table

Question

I'm trying to identify specific fields that are duplicated in a table in a mariadb-10.4.20 Joomla database. I would like to identify all rows that have a specific field duplicated, then ultimately be able to remove those duplicates, leaving just the one with the highest ID.

This table contains the IDs, titles and aliases for the articles in a joomla website. The script I'm building (in perl) will use this information to print the primary title alias and create redirects for any others.

I was previously using "group by" but it appears there's been a change recently in how it's used, and now it doesn't work properly. I don't understand the new format, and I'm not even sure it was previously working fully.

Here's a basic query that shows there are two of the same articles with different IDs:

MariaDB [mydb]> select id,alias,title from db1_content where title = "article title";

+--------+---------------+--------------+
| id     | alias         | title        |
+--------+---------------+--------------+
| 299959 | unique-title | Unique Title  |
| 300026 | unique-title | Unique Title  |
+--------+------------------------------+

Here's an attempt at trying to use "group by" but it returns no results.

MariaDB [mydb]> select id,title,count(title) from db1_content group by id,title having count(title) > 1;
Empty set (0.230 sec)

If I run the same query without the id field, then it does return a list of all titles that are duplicated, along with the number of occurrences of each title.

That's not exactly what I want, though. I need it to print the id, alias and title fields so I can reference them in my perl script to subsequently perform another query to ultimately delete the duplicates and create links to be used in RewriteRules.

What am I doing wrong?

[MySQL / MariaDB - Remove all but last record per day from database](https://stackoverflow.com/q/61660726/2943403), [MariaDB - delete all values that are `not the max` of one column with a group by on another column](https://stackoverflow.com/q/70320787/2943403), [SQL delete if query returns more than 2 rows](https://stackoverflow.com/q/71756301/2943403), [SQL - delete all except the last N rows by serial_numb](https://stackoverflow.com/q/69085578/2943403) — mickmackusa, Apr 20 '22 at 08:40
[Delete all rows per group except one using CTE](https://stackoverflow.com/q/71475781/2943403), [delete ids from subquery results](https://stackoverflow.com/q/49781354/2943403), [sql, delete duplicate records using an ID](https://stackoverflow.com/q/63063163/2943403) — mickmackusa, Apr 20 '22 at 08:57
Does this answer your question? [SQL - delete all except the last N rows by serial\_numb](https://stackoverflow.com/questions/69085578/sql-delete-all-except-the-last-n-rows-by-serial-numb) — Nick, Apr 20 '22 at 09:14

Nick · Answer 1 · 2022-04-20T04:00:25.520

0

Since MariaDB cannot currently delete from a CTE, you could use a derived table to generate row numbers for each title ordered by id descending, JOIN that to your main table and then delete any row which has a row number greater than 1. For example:

DELETE db1 FROM db1_content db1
JOIN (
  SELECT id, 
         ROW_NUMBER() OVER (PARTITION BY title ORDER BY id DESC) AS rn
  FROM db1_content
  ) dbr ON db1.id = dbr.id
WHERE dbr.rn > 1

If you don't want to actually delete the records using SQL, you can just select the ones that need to be deleted by using a CTE:

WITH rns AS (
  SELECT *, 
         ROW_NUMBER() OVER (PARTITION BY title ORDER BY id DESC) AS rn
  FROM db1_content
)
SELECT id, alias, title
FROM rns
WHERE rn > 1

Demo on dbfiddle

edited Apr 20 '22 at 04:00

answered Apr 20 '22 at 03:33

Nick

138,499
22
57
95

This failed with "ERROR 1140 (42000): Mixing of GROUP columns (MIN(),MAX(),COUNT(),...) with no GROUP columns is illegal if there is no GROUP BY clause" on mariadb-10.5. Can this be done in terms that don't involve the new CTE method? What would someone have done prior to 10.2? This is all over my head. – Alex Regan Apr 20 '22 at 14:17
I should add that I don't actually want to delete the rows from the table - I need to print them out so I can produce RewriteRules for redirects from the lower IDs to the one with the highest ID. I can then ultimately choose to delete them by ID later – Alex Regan Apr 20 '22 at 14:23
I just located the following, which seems to indicate it's a bug or other config change that's apparently required for the above to work. https://jira.mariadb.org/browse/MDEV-24409 I still don't understand how this works, so a simplified version would be appreciated, but I do see that it now produces the desired results. – Alex Regan Apr 20 '22 at 16:05
@AlexRegan There are no GROUP columns in the queries I wrote? Prior to (not so much CTEs but window functions people would have used something like the answer to [this question](https://stackoverflow.com/questions/70320787/mariadb-delete-all-values-that-are-not-the-max-of-one-column-with-a-group-by) – Nick Apr 20 '22 at 23:24

Identify duplicate fields in a table

1 Answers1