0

I have a table called geo_location which keeps public ip range in the city( in integer form). I have to query an ip in the table and check if it lies between start_range and end_range and then return the corresponding row. The table has 2929393 rows.

Here is my table descriptin:

CREATE TABLE acrs.geo_location (
    start_range bigint,
    end_range bigint,
    city text,
    country text,
    country_init text,
    latitude double,
    longitude double,
    state text,
    PRIMARY KEY (start_range, end_range)
) WITH CLUSTERING ORDER BY (end_range ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

And this is the example data in the table:

@ Row 1
--------------+---------------------------------
 start_range  | 3753640192
 end_range    | 3753640447
 city         | Tonk
 country      | India
 country_init | IN
 latitude     | 26.16667
 longitude    | 75.78333
 state        | Rajasthan

@ Row 2
--------------+---------------------------------
 start_range  | 1358168576
 end_range    | 1358171135
 city         | Kent
 country      | United Kingdom
 country_init | GB
 latitude     | 51.25
 longitude    | 0.75
 state        | England

Now when I query:

select * from geo_location where start_range < 2534358817 and end_range > 2534358817 ALLOW FILTERING ;

it takes a lot of time before returning the resultant row. It's possible that my data model is not right since I have recently shifted from relational databases to this, hence looking for some help regarding the same. Thanks.

Devanshu Misra
  • 773
  • 1
  • 9
  • 28

2 Answers2

3

Your DB structure is most likely wrong when you have to use ALLOW FILTERING.

In cassandra it is not possible to use ranges queries on partition keys. What you need to do is modify your table structure to have a partition key which clusters your data (for example by date) and clustering keys start_range , end_range.

See: Difference between partition key, composite key and clustering key in Cassandra?

Alex Tbk
  • 2,042
  • 2
  • 20
  • 38
  • That is the problem I suppose. But in my case, I don't have any such key(unless I create one myself) like date which I can use as partition key. – Devanshu Misra Jan 08 '20 at 10:08
  • 2
    You need to think about an artificial key -for example somethng that you derive from the start_range. You need to select this key in such manner, that the partitions do not exceed 100MB , which is a pretty good rule of the thumb – Alex Tbk Jan 08 '20 at 10:59
  • you could also use the insert time to forumlate date/time like `now()` will give you a timeuuid. That can help towards making a key unique. – markc Jan 08 '20 at 14:13
  • @AlexTbk I've been pondering over your idea of using an artificial key, but to do that, I'll probably have to use `city`, `state` or `country` as a partition since no other (`start_range` or `end_range`) data can be modified to make such keys. But the problem here is, while fetching the data I'll be querying `start_range` or `end_range` to get the city, state or country of the range provided. – Devanshu Misra Jan 09 '20 at 09:33
  • @markc That would surely make my row unique but it won't make my result fetch any faster. That's what my best guess is – Devanshu Misra Jan 09 '20 at 09:34
  • Generally speaking as the poster above says this is a data modelling issue. You either denormalize your data for performance or keep it all in one table but at the mercy of performance penalties. – markc Jan 09 '20 at 14:11
0

Ideally, you want to reduce the number of rows searched. One option is to create and populate another table range_by_country (country, state, start_range, end_range), you first query this table to find the country and state the ip belongs to.

Next, modify your initial table's primary key to:

PRIMARY KEY ((country, state), start_range, end_range)

Now add the country and state to your query. This should greatly increase performance since we are searching all entries in a state rather than all entries of the database.

Leo103
  • 691
  • 7
  • 16