How to avoid data duplicates in ClickHouse

Question

I already read this but I still have questions. I only have one VM with 16 GB of RAM, 4 cores and a disk of 100 GB, with only ClickHouse and a light web api working on it.

I'm storing leaked credentials in a database:

CREATE TABLE credential (
    user String,
    domain String,
    password String,
    first_seen Date,
    leaks Array(UInt64)
) ENGINE ReplacingMergeTree
PARTITION BY first_seen
ORDER BY user, domain, password, first_seen

It something happens that some credentials appear more than once (inside a file or between many).

My long-term objective is(was) the following: - when inserting a credential which is already in the database, I want to keep the smaller first_seen and add the new leak id to the field leaks.

I have tried the ReplacingMergeTree engine, insert twice the same data ($ cat "data.csv" | clickhouse-client --query 'INSERT INTO credential FORMAT CSV') and then performed OPTIMIZE TABLE credential to force the replacing engine to do its asynchronous job, according to the documentation. Nothing happens, data is twice in the database.

So I wonder:
- what did i miss with the ReplacingMergeTree engine ? - how does OPTIMIZE work and why doesn't it do what I was expecting from it ?
- is there a real solution for avoiding replicated data on a single instance of ClickHouse ?

I have already tried to do it manually. My problem is a have 4.5 billions records into my database, and identifying duplicates inside a 100k entries sample almost takes 5 minutes with the follow query: SELECT DISTINCT user, domain, password, count() as c FROM credential WHERE has(leaks, 0) GROUP BY user, domain, password HAVING c > 1 This query obviously does not work on the 4.5b entries, as I do not have enough RAM.

Any ideas will be tried.

RoyB · Answer 1 · 2019-07-03T10:46:42.250

4

Multiple things are going wrong here:

You partition very granulary... you should partition by something like a month of data, whatsoever. Now clickhous has to scan lots of files.
You dont provide the table engine with a version. The problem here is, that clickhouse is not able to find out wich row should replace the other. I suggest you use the "version" parameter of the ReplacingMergeTree, as it allows you to provide an incremental version as a number, or if this works better for you, the current DateTime (where the last DateTime always wins)
You should never design your solution to require OPTIMIZE be called to make your data consistent in your result sets, it is not designed for this.

Clickhouse always allows you to write a query where you can provide (eventual) consistency without using OPTIMIZE beforehand.

Reason for avoiding OPTIMIZE, besides being really slow and heavy on your DB, you could end up in race conditions, where other clients of the database (or replicating clickhouse nodes) could invalidate your data between the OPTIMIZE finished and the SELECT is done.

Bottomline, as a solution: So what you should do here is, add a version column. Then when inserting rows, insert the current timestamp as a version. Then select for each row only the one that has the highest version in your result so that you do not depend on OPTIMIZE for anything other then garbage collection.

edited Jul 03 '19 at 10:46

answered Jul 03 '19 at 10:36

RoyB

3,104
1
16
37

If I define the partition key to months, it will result with an enormous one (I have a collection of almost 1 TB of credentials to insert) and every other partition will be ridiculous. I choose partitioning by `first_seen` because I thought even if I have more partitions, they would be more balanced. As I have awesome response times, I did not consider to change it. I did not provide a version because in this scenario, both entries are literally the same. I do not have a lot of disk space so I'm trying to reduce the quantity of information as much as possible. – Chocorean Jul 03 '19 at 10:56
According to the doc, "When merging, ReplacingMergeTree from all the rows with the same primary key leaves only one: - Last in the selection, if ver not set." so I suppose ver is not mandatory and it should delete something. And finally, I did not designed this to be working with OPTIMIZE, I've jsut discovered it reading the doc lol – Chocorean Jul 03 '19 at 10:56
I'm just trying to explain my choices in order to help you understanding how did I think and help you correcting my mistakes ! – Chocorean Jul 03 '19 at 11:10
1

Ok, clear enough; you should aim for 10's to 100's of partitions. IF you end up with more than a thousands that would be inefficient. Theres documentation on that. You should wait for clickhouse to finish deduplication, but with 1TB of data (billions of rows?) thats going to take a while. Just give it time to merge all rows. With the version column atleast you would be able to only fetch the last version of each row (by selecting on HAVING version=max(version), where you fetch max from a join) – RoyB Jul 04 '19 at 15:22
1

Also, ideally you would have to involve as little as possible partitions into the result set of 1 query, and always mention the partition key as a filter in your WHERE conditions. Thisll prevent a full partition scan, and make querys much faster. – RoyB Jul 04 '19 at 15:27
This is exactly how it will be used. When done, this app will be asked by another one everyday to fetch every new creds since last day, where the domain matches one of our customers' domain. I think i'm going to go on one partition per month, that sounds legit for real application. – Chocorean Jul 09 '19 at 06:49

How to avoid data duplicates in ClickHouse

1 Answers1