postgres: fulltext search: fastest way to find duplicate text rows?

Question

:-)

What is the fastest way to find duplicate texts in a table, i.e., rows in a table that has a text in one column that occurs at least twice in the whole table? The table contains over 160 million rows.

I have a table that consists of the columns: id, maintext, and maintext_token, where the latter was created with to_tsvector(maintext);. Moreover, I created a GIN index on maintext_token, i.e., create index idx_maintext_tokens on tablename using gin(maintext_token);

Currently, I'm using the following, which - however - takes a rather long time: I have a table that consists of the columns: id, maintext, and maintext_token, where the latter was created with to_tsvector(maintext);. Moreover, I created a GIN index on maintext_token, i.e., create index idx_maintext_tokens on tablename using gin(maintext_token);

select maintext, count(maintext)
from ccnc
group by maintext
having count(maintext)>1
order by maintext;

I also tried to do the same operation, but instead of using maintext I use the maintext_token column for the comparison:

select maintext_token, count(maintext_token)
from ccnc
group by maintext_token
having count(maintext_token)>1
order by maintext_token;

Both queries seem to run long, even though I was expecting at least the second query to be much faster, since postgres can use the index for comparison.

Thank you in advance for any insights! Cheers :)

score 0 · Answer 1 · answered Oct 05 '19 at 23:32

You say that you want to test for equality, so you might want to hash the text, and then search on the hashes. You can do this with a hash index, or you can index a hash of the text. I got some help recently on a related question, you'll find details and comparisons here:

Searching on expression indexes

postgres: fulltext search: fastest way to find duplicate text rows?

1 Answers1