My problem is:
I'm looking for duplicates in a data set about books but the code I'm currently using does not find all the duplicates I actually have. My data set has 5 columns: book id (int), title (string), writer, publication_year and row_id (int not null). For each row, the values of book id, title, writer and publication_year can be the same, but row_id is always different. The code I'm currently using looks for rows that are identical except for the row_id.
SELECT B1.ID_BOOK, B1.TITLE, B1.WRITER, B1.PUBLICATION_YEAR, B1.ROW_ID
FROM [BOOKS] AS B1, [BOOKS] AS B2
WHERE
B1.ROW_ID > B2.ROW_ID AND
B1.ID_BOOK = B2.ID_BOOK AND
B1.TITLE = B2.TITLE AND
B1.WRITER = B2.WRITER AND
B1.PUBLICATION_YEAR = B2.PUBLICATION_YEAR
The problem is that the column title can contain special characters as well as accents, and as far as I'm concern, it doesn't differenciate between "a" and "A" (upper and lower cases). I want to make sure it doesn't take the following titles as the same:
El día que se perdió la cordura
el día que se perdió la cordura
El dia que se perdio la cordura
el dia que se perdio la cordura
¡El día que se perdió la cordura!
What I have tried:
I've tried using COLLATE SQL_Latin1_General_CP1_CS_AS, but I'm not sure it will be useful if the titles are in other languages.
I thougt that maybe a good solution could be to compare the memory space of each string, so the language is not an issue, so I used DATALENGTH but is not being useful.
For further explanations, I do not want to unificate titles. I want to differentiate Camión from camión and from camion. For me, these are two different titles, and should not be counted as a duplicate.
Links I followed:
Finding duplicate values in a SQL table
How to get duplicate text values from SQL query
https://learnsql.com/blog/how-to-find-duplicate-values-in-sql/
https://chartio.com/learn/databases/how-to-find-duplicate-values-in-a-sql-table/
How to identify duplicates along with special characters in Oracle