1

My team and I have been using crate for one of our projects over the passed few years. We have a table with hundreds of millions of records and performance is key.

As we've developed more and more features on this project, we've ran into interesting problem. We have a column on this table labeled 'persist_date' which is when the record actually got persisted into the table. These dates may not always align and we could have a start_date of 2021-06-21 with a persist_date of 2021-10-14.

All of our queries up this point have easily been able to add a partition against start_date. Now we are encountering a problem which requires us to use a non-partitioned column (persist_date) to query against.

As I understand it, crateDB is really performant but only when you query against 1 specific partition at a time. My question now is how would I go about creating a partition for this other date column without duplicated my data? Is there anything other than a partition that might help, like the way the table is clustered?

  • 1
    May I ask how big your individual shards / partitions are? It is correct, that CrateDB will try to limit queries to partitions / shards that could hold data, i.e. the partition values are used to filter out shards. However queries would be done in a highly parallelized way anyway and also selections on non partition columns should be fast. – proddata Oct 21 '21 at 06:26
  • 1
    Our partitions are currently by week on 'start_date'. With a search over a few weeks on persist_date (non partitioned column), I was seeing queries take upwards of 90 seconds, with most around 30 seconds. With a partition in place or smaller date range queried, I get results back in under a second. – Nate Elliott Oct 21 '21 at 15:59

1 Answers1

2

You could use both columns as partition values. e.g.

CREATE TABLE two_parted (a TEXT, b TEXT, val DOUBLE) PARTITIONED BY (a,b);

If either a or b are used in a selection, this would limit queries to shards that have either value. However this could lead to more shards, so you might want to partitions not on a daily, but weekly or monthly basis.

proddata
  • 216
  • 1
  • 7
  • Thanks for quick answer. I do have some concerns with this solution though. This would double the shard count. I'm not sure doubling shard count is something we would like to do. Per docs, I found 'If the partition columns are badly chosen you can end up with too many shards in the cluster, affecting the overall stability and performance negatively.' This is a bit worry some. Is there additional documentation on partitions around multiple column partitions? – Nate Elliott Oct 21 '21 at 16:08