I already read this but I still have questions. I only have one VM with 16 GB of RAM, 4 cores and a disk of 100 GB, with only ClickHouse and a light web api working on it.
I'm storing leaked credentials in a database:
CREATE TABLE credential (
user String,
domain String,
password String,
first_seen Date,
leaks Array(UInt64)
) ENGINE ReplacingMergeTree
PARTITION BY first_seen
ORDER BY user, domain, password, first_seen
It something happens that some credentials appear more than once (inside a file or between many).
My long-term objective is(was) the following:
- when inserting a credential which is already in the database, I want to keep the smaller first_seen
and add the new leak id to the field leaks
.
I have tried the ReplacingMergeTree engine, insert twice the same data ($ cat "data.csv" | clickhouse-client --query 'INSERT INTO credential FORMAT CSV'
) and then performed OPTIMIZE TABLE credential
to force the replacing engine to do its asynchronous job, according to the documentation. Nothing happens, data is twice in the database.
So I wonder:
- what did i miss with the ReplacingMergeTree engine ?
- how does OPTIMIZE
work and why doesn't it do what I was expecting from it ?
- is there a real solution for avoiding replicated data on a single instance of ClickHouse ?
I have already tried to do it manually. My problem is a have 4.5 billions records into my database, and identifying duplicates inside a 100k entries sample almost takes 5 minutes with the follow query: SELECT DISTINCT user, domain, password, count() as c FROM credential WHERE has(leaks, 0) GROUP BY user, domain, password HAVING c > 1
This query obviously does not work on the 4.5b entries, as I do not have enough RAM.
Any ideas will be tried.