1

I am struggling with clickhouse to keep unique data row per a PK.

I choose this Column base DB to express statistics data quickly and very satisfied with its speed. However, got some duplicated data issue here.

The test table looks like...

CREATE TABLE test2 (
    `uid`                String COMMENT 'User ID',
    `name`              String COMMENT 'name'
) ENGINE ReplacingMergeTree(uid)
ORDER BY uid
PRIMARY KEY uid;

Let's presume that I am going to use this table to join for display names(name field in this table). However, I can insert many data as I want in same PK(Sorting key).

For Example

INSERT INTO test2
(uid, name) VALUES ('1', 'User1');
INSERT INTO test2
(uid, name) VALUES ('1', 'User2');
INSERT INTO test2
(uid, name) VALUES ('1', 'User3');
SELECT * FROM test2 WHERE uid = '1';

Now, I can see 3 rows with same sorting key. Is there any way to make key unique, at least, prevent insert if the key exists?

Let's think about below scenario

tables and data are

CREATE TABLE blog (
    `blog_id` String,
    `blog_writer` String
) ENGINE MergeTree
ORDER BY tuple();


CREATE TABLE statistics (
    `date` UInt32,
    `blog_id` String,
    `read_cnt` UInt32,
    `like_cnt` UInt32
) ENGINE MergeTree
ORDER BY tuple();


INSERT INTO blog (blog_id, blog_writer) VALUES ('1', 'name1');
INSERT INTO blog (blog_id, blog_writer) VALUES ('2', 'name2');

INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202007, '1', 10, 20);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202008, '1', 20, 0);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202009, '1', 3, 1);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202008, '2', 11, 2);

And here is summing query

SELECT
    b.writer,
    a.read_sum,
    a.like_sum
FROM
     (
         SELECT
            blog_id,
            SUM(read_cnt) as read_sum,
            SUM(like_cnt) as like_sum
         FROM statistics
         GROUP BY blog_id
     ) a JOIN
     (
         SELECT blog_id, blog_writer as writer FROM blog
     ) b
    ON a.blog_id = b.blog_id;

At this moment it works fine, but if there comes a new low like

INSERT INTO statistics(date, blog_id, read_cnt, like_cnt) VALUES (202008, '1', 60, 0);

What I expected is update low and sum of the "name1"'read_sum is 73. but it shows 93 since it allows duplicated insert.

Is there any way to

  • prevent duplicated insert
  • or set unique guaranteed PK in table

Thanks.

Juneyoung Oh
  • 7,318
  • 16
  • 73
  • 121
  • 1
    consider to use [FINAL modifier](https://clickhouse.tech/docs/en/sql-reference/statements/select/from/#select-from-final) in FROM clause – vladimir Aug 19 '20 at 11:01
  • 1
    at this moment *eventually* deduplication is provided by [Replicated*-engines](https://stackoverflow.com/a/53702158/303298) – vladimir Aug 19 '20 at 11:11
  • @vladimir Thanks for the comment. You may add an answer post so I could check yours as an answer and close. – Juneyoung Oh Aug 24 '20 at 01:12

3 Answers3

2

One thing that comes to mind is ReplacingMergeTree. It won't guarantee absence of duplication right away, but it it will do so eventually. As docs state:

Data deduplication occurs only during a merge. Merging occurs in the background at an unknown time, so you can’t plan for it. Some of the data may remain unprocessed.

Another approach that i personally use is introducing another column named, say, _ts - a timestamp when row was inserted. This lets you track changes and with help of clickhouse's beautiful limit by you can easily get last version of a row for given pk.

CREATE TABLE test2 (
    `uid`               String COMMENT 'User ID',
    `name`              String COMMENT 'name',
    `_ts`               DateTime
) ENGINE MergeTree(uid)
ORDER BY uid;

Select would look like this:

SELECT uid, name FROM test2 ORDER BY _ts DESC LIMIT 1 BY uid;

In fact, you don't need a pk, just specify any row/rows in limit by that you need rows to be unique by.

go2nirvana
  • 1,598
  • 11
  • 20
1

Besides ReplacingMergeTree which runs deduplication asynchronously, so you can have temporarily duplicated rows with the same pk, you can use CollapsingMergeTree or VersionedCollapsingMergeTree.

With CollapsingMergeTree you could do something like this:

CREATE TABLE statistics (
    `date` UInt32,
    `blog_id` String,
    `read_cnt` UInt32,
    `like_cnt` UInt32,
    `sign` Int8

) ENGINE CollapsingMergeTree(sign)
ORDER BY tuple()
PRIMARY KEY blog_id;

The only caveat is on every insert of a duplicated PK you have to cancel the previous register, something like this:

# first insert
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt, sign) VALUES (202008, '1', 20, 0, 1);
# cancel previous insert and insert the new one
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt, sign) VALUES (202008, '1', 20, 0, -1);
INSERT INTO statistics(date, blog_id, read_cnt, like_cnt, sign) VALUES (202008, '1', 11, 2, 1);
Dharman
  • 30,962
  • 25
  • 85
  • 135
alrocar
  • 178
  • 1
  • 6
0

I do not think this is a solution for the problem, but at least I detour above problem in this way in the perspective of business.

Since clickhouse officially does not support modification of table data.(They provide ALTER TABLE ... UPDATE | DELETE, but eventually those will rewrite the table) I split the table into small multiple partitions(In my case, 1 partition has about 50,000 data) and if duplicated data comes, 1) drop the partition 2) re-insert data again. In above case, I alway execute ALTER TABLE ... DROP PARTITION statement before insert.

I also have tried ReplacingMergeTree, but data duplication still occurred.(Maybe I do not understand how to use the table but I gave a single sorting key - and when I insert duplicated data there are multiple data in same sorting key)

Juneyoung Oh
  • 7,318
  • 16
  • 73
  • 121