How to eliminate similar records from a column of a table in PostgreSQL

Question

I have a column in a table containing a list of emails. I need to keep the latest emails from the list of similar emails within the table. For eg. if the table contains elements as follows:

+--------------------------+------------------------+
|          Email           |      Received at       |
+--------------------------+------------------------+
| aespinola@aaa.com        | 2016-08-04 20:56:53+00 |
| aespinola@aaa.co         | 2016-08-04 20:56:52+00 |
| aespinola@aaa            | 2016-08-04 20:56:51+00 |
| tracy-lee.danie@lobsterink.com | 2016-08-04 10:56:53+00 |
| trac@lobsterink.com            | 2016-08-04 10:56:52+00 |
| accounts@abc.com         | 2016-08-04 06:57:32+00 |
| accounts@abc.com.au      | 2016-08-04 06:57:46+00 |
| Mahendra.chouhan@xyz.com | 2016-08-04 13:54:42+00 |
+--------------------------+------------------------+

The final output should be as follows:

+--------------------------+------------------------+
|          Email           |      Received at       |
+--------------------------+------------------------+
| aespinola@aaa.com        | 2016-08-04 20:56:53+00 |
| tracy-lee.danie@lobsterink.com | 2016-08-04 10:56:53+00 |
| accounts@abc.com.au      | 2016-08-04 06:57:46+00 |
| Mahendra.chouhan@xyz.com | 2016-08-04 13:54:42+00 |
+--------------------------+------------------------+

Using the link below, I was able to find out which emails are similar to each other. Grouping them is the next step. I couldn't figure it out.

Finding similar strings with PostgreSQL quickly

UPDATE: I have added the code which gives the similarity between emails here as requested in the comment:

CREATE EXTENSION pg_trgm;

DROP TABLE IF EXISTS roshan_email_list;

CREATE TEMPORARY TABLE roshan_email_list AS (
SELECT EXTRACT(MONTH
FROM received_at) AS MONTH, EXTRACT(YEAR
FROM received_at) AS YEAR, 
email
FROM users
group by month, year, email
);

CREATE INDEX roshan_email_list_gist ON roshan_email_list 
USING gist(email gist_trgm_ops);

SELECT set_limit(0.75);

-- The below query gives the similarity between emails 
WITH email_similarity AS 
(
SELECT similarity(n1.email, n2.email) AS sim, 
n1.email AS email, n2.email AS similar_email,
n1.month, n1.year
FROM roshan_email_list n1
JOIN roshan_email_list n2 ON n1.email <> n2.email AND n1.email % n2.email AND n1.month = n2.month AND n1.year = n2.year
WHERE n1.year = 2016 
ORDER BY sim DESC
)
SELECT e.sim, e.email, u.received_at, 
e.similar_email, e.month, e.year
FROM email_similarity e
INNER JOIN callinize.users u ON e.email = u.email;

Show us the query you use to find the similar ones (**[edit]** your question) — , Oct 24 '16 at 18:19

score 0 · Answer 1 · answered Oct 24 '16 at 18:48

0

Not sure regarding your whole dataset.

with data as (
  select 
    split_part(email, '@', 1) as first,
    split_part(split_part(email, '@', 2), '.', 1) as second,
    received_at, 
    email 
  from emails
), 
ndata as (
  select *, 
    row_number() over (partition by first, second order by received_at desc) 
  from data
)
select 
  email, received_at 
from ndata 
where row_number = 1;

answered Oct 24 '16 at 18:48

Nikhil

624
5
15

It's a good idea. But I noticed that there are some entries in my table for which that won't work. I have updated the question accordingly. – roshan shetty Oct 24 '16 at 19:13
I think you will need to use stored procedure. It doesn't seem feasible with mere queries. – Nikhil Oct 24 '16 at 20:02

How to eliminate similar records from a column of a table in PostgreSQL

1 Answers1