Azure SQL Cannot find duplicates correctly in a free text field

Question

My problem is:

I'm looking for duplicates in a data set about books but the code I'm currently using does not find all the duplicates I actually have. My data set has 5 columns: book id (int), title (string), writer, publication_year and row_id (int not null). For each row, the values of book id, title, writer and publication_year can be the same, but row_id is always different. The code I'm currently using looks for rows that are identical except for the row_id.

SELECT B1.ID_BOOK, B1.TITLE, B1.WRITER, B1.PUBLICATION_YEAR, B1.ROW_ID
FROM [BOOKS] AS B1, [BOOKS] AS B2
WHERE
B1.ROW_ID > B2.ROW_ID AND
B1.ID_BOOK = B2.ID_BOOK AND 
B1.TITLE = B2.TITLE AND 
B1.WRITER = B2.WRITER AND 
B1.PUBLICATION_YEAR = B2.PUBLICATION_YEAR

The problem is that the column title can contain special characters as well as accents, and as far as I'm concern, it doesn't differenciate between "a" and "A" (upper and lower cases). I want to make sure it doesn't take the following titles as the same:

El día que se perdió la cordura
el día que se perdió la cordura
El dia que se perdio la cordura
el dia que se perdio la cordura
¡El día que se perdió la cordura!

What I have tried:

I've tried using COLLATE SQL_Latin1_General_CP1_CS_AS, but I'm not sure it will be useful if the titles are in other languages.
I thougt that maybe a good solution could be to compare the memory space of each string, so the language is not an issue, so I used DATALENGTH but is not being useful.
For further explanations, I do not want to unificate titles. I want to differentiate Camión from camión and from camion. For me, these are two different titles, and should not be counted as a duplicate.

Links I followed:

Finding duplicate values in a SQL table

How to get duplicate text values from SQL query

https://learnsql.com/blog/how-to-find-duplicate-values-in-sql/

https://chartio.com/learn/databases/how-to-find-duplicate-values-in-a-sql-table/

How to identify duplicates along with special characters in Oracle

Create a scalar function normalising the Title (to upper case, exclude meaningles symbols, replace accent symbols, etc). Use the function to compare Titles. For better perfomance store the normalized title along with original one as a computed column. — Serg, Jan 19 '22 at 10:29
FYI, it's *2022*, the ANSI-92 explicit JOIN syntax has been around for **30 years**; it is long last time you adopted it. There is no reason to be using that 1980's implicit JOIN syntax anymore. — Thom A, Jan 19 '22 at 10:30
_but the code I'm currently using does not_ And what is that code? _it doesn't differenciate_ What is "it"? And since you're asking, post DDL for the table so no one needs to guess at how it is actually defined. — SMor, Jan 19 '22 at 10:42
Hello can you try hashing the title and then joining it? select hashbytes('MD5', b1.title) hash_b1, b1.title from books b1 join ( select hashbytes('MD5', b.title) hash_b2, b.title from books b) b2 on b2.hash_b2 = hash_b1 — Dominik Klug, Jan 19 '22 at 10:42

score 0 · Answer 1 · answered Jan 20 '22 at 11:38

You can compare the title column after converting to Varbinary.

I had tried it in my lab as below.

Inserted the first 5 rows with the titles you provide with different row_id and inserted another row with row_id 6 which has the title exactly the same as row_id 1 to identify duplicate values.

Added binary conversion while comparing titles in your code.

SELECT B1.ID, B1.TITLE, B1.WRITER, B1.PUBLICATION_YEAR, B1.ROW_ID FROM [BOOKS] AS B1, [BOOKS] AS B2 WHERE B1.ROW_ID > B2.ROW_ID AND B1.ID = B2.ID AND convert(varbinary,B1.TITLE) = convert(varbinary,B2.TITLE) AND B1.WRITER = B2.WRITER AND B1.PUBLICATION_YEAR = B2.PUBLICATION_YEAR

This results in only 1 row (row_id = 6) which is duplicated in the table.

Azure SQL Cannot find duplicates correctly in a free text field

1 Answers1