Find possible duplicates in two columns ignoring case and special characters

Question

Query

SELECT COUNT(*), name, number
FROM   tbl
GROUP  BY name, number
HAVING COUNT(*) > 1

It sometimes fails to find duplicates between lower case and upper case.
E.g.: sunny and Sunny don't show up as a duplicates.
So how to find all possible duplicates in PostgreSQL for two columns.

When you say "special characters" do you mean that you want "Soft" and "$s^o&f!t" to be equal? Or are you talking about accented characters, where you want "Cafe" and "Café" to be matched as equal? — Craig Ringer, Oct 20 '12 at 01:33

Erwin Brandstetter · Accepted Answer · 2021-03-12T00:51:39.037

18

`lower()`/ `upper()`

Use one of these to fold characters to either lower or upper case. Special characters are not affected:

SELECT count(*), lower(name), number
FROM   tbl
GROUP  BY lower(name), number
HAVING count(*) > 1;

`unaccent()`

If you actually want to ignore diacritic signs, like your comments imply, install the additional module unaccent, which provides a text search dictionary that removes accents and also the general purpose function unaccent():

CREATE EXTENSION unaccent;

Makes it very simple:

SELECT lower(unaccent('Büßercafé'));

Result:

busercafe

This doesn't strip non-letters. Add regexp_replace() like @Craig mentioned for that:

SELECT lower(unaccent(regexp_replace('$s^o&f!t Büßercafé', '\W', '', 'g') ));

Result:

softbusercafe

You can even build a functional index on top of that:

Does PostgreSQL support "accent insensitive" collations?

edited Mar 12 '21 at 00:51

answered Oct 19 '12 at 17:36

Erwin Brandstetter

605,456
145
1,078
1,228

is there something similar to RLIKE in postgre – Ghostman Oct 19 '12 at 17:38
@soul: Case insensitive `LIKE`? [`ILIKE`](http://www.postgresql.org/docs/current/interactive/functions-matching.html#FUNCTIONS-LIKE). – Erwin Brandstetter Oct 19 '12 at 17:40
In mysql there is function RLIKE ... is ILIKE similar to it in postgre http://dev.mysql.com/doc/refman/5.0/en/regexp.html – Ghostman Oct 19 '12 at 17:41
You can use the [regular expression match operator **`~`**](http://www.postgresql.org/docs/current/interactive/functions-matching.html#FUNCTIONS-POSIX-REGEXP) for a similar purpose as MySQL's [RLIKE](http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp). – Erwin Brandstetter Oct 19 '12 at 17:43
I have seen from the comments, you want to "unaccent", too. Well, there is a much simpler and proven way. I added a bit to my answer. – Erwin Brandstetter Oct 26 '12 at 08:34
Huh, I didn't know about `unaccent`. Useful especially since it works with tsearch2. I still think the whole idea of "unaccenting" is pretty terrible, but if you're stuck with terrible data I guess you have to do terrible things. – Craig Ringer Oct 26 '12 at 09:39
@CraigRinger: Unaccenting can be *extremely useful* to keep the rate of alpha errors low (undesirable rows found) while also reducing beta errors (desired rows not found) in a fuzzy search with human input, especially in languages with lots of (more or less optional) accents, i.e. most languages that are not English. – Erwin Brandstetter Oct 26 '12 at 20:51
@ErwinBrandstetter there is a problem with the query it is not able to find the duplicates in the name!!! but wen we do `group by name having count(name)>1` i can get the duplicates – Ghostman Oct 30 '12 at 05:55
@soul: I suppose you post a new question with your query and some sample data. The problem should be easy to find with this information. You can always link to this question for context. – Erwin Brandstetter Oct 30 '12 at 05:59
@ErwinBrandstetter can u suggest a proper question for this!! like i will update the question !!! need to update ur query with the records is it – Ghostman Oct 30 '12 at 06:05
@soul: Post a *new* question showing the exact query and and example values that it fails with. I don't have enough information to work with (I don't quite understand what your problem is). – Erwin Brandstetter Oct 30 '12 at 06:11
@ErwinBrandstetter posted it http://stackoverflow.com/questions/13133599/query-not-able-to-find-the-duplicates-between-the-name – Ghostman Oct 30 '12 at 06:12

score 4 · Answer 2 · answered Oct 19 '12 at 17:36

PostgreSQL by default is case sensitive. You can force it to be case-insensitive during searches by converting all values to a single case:

SELECT COUNT(*), lower(name), number FROM TABLE 
GROUP BY lower(name), number HAVING COUNT(*) > 1

NOTE: This has not been tested in Postgres

Craig Ringer · Answer 3 · 2012-10-23T09:33:34.343

(Updated answer after clarification from poster): The idea of "unaccenting" or stripping accents (dicratics) is generally bogus. It's OK-ish if you're matching data to find out if some misguided user or application munged résumé into resume, but it's totally wrong to change one into the other, as they're different words. Even then it'll only kind-of work, and should be combined with a string-similarity matching system like trigrams or Levenshtein distances.

The idea of "unaccenting" presumes that any accented character has a single valid equivalent unaccented character, or at least that any given accented character is replaced with at most one unaccented character in an ascii-ized representation of the word. That simply isn't true; in one language ö might be a "u" sound, while in another it might be a long "oo", and the "ascii-ized" spelling conventions might reflect that. Thus, in language the correct "un-accenting" of the made-up dummy-word "Tapö" might be "Tapu" and in another this imaginary word might be ascii-ized to "Tapoo". In neither case will the "un-accented" form of "Tapo" match what people actually write when forced into the ascii character set. Words with dicratics may also be ascii-ized into a hyphenated word.

You can see this in English with ligatures, where the word dæmon is ascii-ized daemon. If you stripped the ligature you'd get dmon which wouldn't match daemon, the common spelling. The same is true of æther which is typically ascii-ized to aether or ether. You can also see this in German with ß, typically "expanded" as ss.

If you must attempt to "un-accent", "normalize" accents or "strip" accents:

You can use a character class regular expression to strip out all but a specified set of characters. In this case we use the \W escape (shorthand for the character class [^[:alnum:]_] as per the manual) to exclude "symbols" but not accented characters:

regress=# SELECT regexp_replace(lower(x),'\W','','g') 
          FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
 regexp_replace 
----------------
 soft
 café
(2 rows)

If you want to filter out accented chars too you can define your own character class:

regress=# SELECT regexp_replace(lower(x),'[^a-z0-9]','','g')
          FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
 regexp_replace 
----------------
 soft
 caf
(2 rows)

If you actually intended to substitute some accented characters for similar unaccented characters, you could use translate as per this wiki article:

regress=# SELECT translate(
        lower(x),
        'âãäåāăąÁÂÃÄÅĀĂĄèééêëēĕėęěĒĔĖĘĚìíîïìĩīĭÌÍÎÏÌĨĪĬóôõöōŏőÒÓÔÕÖŌŎŐùúûüũūŭůÙÚÛÜŨŪŬŮ',
        'aaaaaaaaaaaaaaaeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiooooooooooooooouuuuuuuuuuuuuuuu'
    )
    FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);

 translate 
-----------
 $s^o&f!t
 cafe
(2 rows)

i used php regular expression stored it to a variable and did it!!! it got a length process!!! anyways thanks mate... will accept ur answer!!! — Ghostman, Oct 23 '12 at 13:23

Find possible duplicates in two columns ignoring case and special characters

Query

3 Answers3

`lower()`/ `upper()`

`unaccent()`

If you must attempt to "un-accent", "normalize" accents or "strip" accents:

Linked

Find possible duplicates in two columns ignoring case and special characters

Query

3 Answers3

lower()/ upper()

unaccent()

If you must attempt to "un-accent", "normalize" accents or "strip" accents:

Linked

`lower()`/ `upper()`

`unaccent()`