:-)
What is the fastest way to find duplicate texts in a table, i.e., rows in a table that has a text in one column that occurs at least twice in the whole table? The table contains over 160 million rows.
I have a table that consists of the columns: id
, maintext
, and maintext_token
, where the latter was created with to_tsvector(maintext);
. Moreover, I created a GIN index on maintext_token
, i.e., create index idx_maintext_tokens on tablename using gin(maintext_token);
Currently, I'm using the following, which - however - takes a rather long time:
I have a table that consists of the columns: id
, maintext
, and maintext_token
, where the latter was created with to_tsvector(maintext);
. Moreover, I created a GIN index on maintext_token
, i.e., create index idx_maintext_tokens on tablename using gin(maintext_token);
select maintext, count(maintext)
from ccnc
group by maintext
having count(maintext)>1
order by maintext;
I also tried to do the same operation, but instead of using maintext
I use the maintext_token
column for the comparison:
select maintext_token, count(maintext_token)
from ccnc
group by maintext_token
having count(maintext_token)>1
order by maintext_token;
Both queries seem to run long, even though I was expecting at least the second query to be much faster, since postgres can use the index for comparison.
Thank you in advance for any insights! Cheers :)