Data structure for range query

Question

I was recently asked a coding question on the below problem. I have some solution to this problem but I am not very sure if those are most efficient.

Problem:

Write a program to track set of text ranges. Start point and end point will be string.

Text range example : [AbA-Ef]
 Aa would fall before this range
 AB would fall inside this range
 etc.

String comparison would be like 'A' < 'a' < 'B' < 'b' ... 'Z' < 'z'

We need to support following operations on this range

Add range - this should merge the ranges if applicable
Delete range - this deletes range from tracked ranges and recompute the ranges
Query range - Given a character, function should return whether it is part of any of tracked ranges or not.

Note that tracked ranges can be dis-continuous.

My solutions:

I came up with two approaches.

Store ranges as doubly linked list or
Store ranges as some sort of balanced tree with leaf node having actual data and they are inter-connected as linked list.

Do you think that this solution are good enough or you can think of any better way of doing this so that those three API gives your best performance ?

Whether something is efficient or not entirely depends on the problem bounds. In this case, how many queries are there going to be? — flight, Oct 04 '12 at 05:14
This sounds a lot like homework... also it would help if you could give some examples, as stated the question is not very clear to me. As general advice though, write a brute force, slow, but working implementation first. This will become your test. See where that test is slow, and fix that. — starmole, Oct 04 '12 at 05:19
This is not homework question. As already noted in question, this question was asked in coding test for interview. — Lance Reynolds, Oct 04 '12 at 05:27

amit · Answer 1 · 2012-10-04T07:16:27.337

12

You are probably looking for an interval tree.

Use the data structure with your custom comparator to indicate "What's on range", and you will be able to do the required operations efficiently.

Note, an interval tree is actually an efficient way to implement your 2nd idea (Store ranges as a some sort of balanced tree)

edited Oct 04 '12 at 07:16

answered Oct 04 '12 at 06:32

amit

175,853
27
231
333

score 1 · Answer 2 · answered Oct 04 '12 at 05:20

I'm not clear on what the "delete range" operation is supposed to do. Does it;

Delete a previously inserted range, and recompute the merge of the remaining ranges?
Stop tracking the deleted range, regardless of how many times parts of it have been added.

That doesn't make a huge difference algorithmically; it's just bookkeeping. But it's important to clarify. Also, are the ranges closed or half-open? (Another detail which doesn't affect the algorithm but does affect the implementation).

The basic approach to this problem is to merge the tracked set into a sorted list of disjoint (non-overlapping) ranges; either as a vector or a binary search tree, or basically any structure which supports O(log n) searching.

One approach is to put both endpoints of every disjoint range into the datastructure. To find out if a target value is in a range, find the index of the smallest endpoint greater than the target. If the index is odd the target is in some range; even means it's outside.

Alternatively, index all the disjoint ranges by their start points; find the target by searching for the largest start-point not greater than the target, and then compare the target with the associated end-point.

I usually use the first approach with sorted vectors, which are plausible if (a) space utilization is important and (b) insert and merge are relatively rare. With binary search trees, I go for the second approach. But they differ only in details and constants.

Merging and deleting are not difficult, but there are an annoying number of cases. You start by finding the ranges corresponding to the endpoints of the range to be inserted/deleted (using the standard find operation), remove all the ranges in between the two, and fiddle with the endpoints to correct the partially overlapping ranges. While the find operation is always O(log n), the tree/vector manipulation is o(n) (if the inserted/deleted range is large, anyway).

Hi, Thanks for your answer. Delete range should basically re-compute the range. Eg. If we delete [C-D] from [A-G] then it should give two ranges -> [A-B] and [E-G] — Lance Reynolds, Oct 04 '12 at 05:25
That's the simplest case; my answer basically stands, then. You should clarify whether you're talking about strings or characters, though: your first statements are string ranges. Or do you mean [A-G] to mean "all strings starting with a letter between A and G, no matter how long"? (so that "Gz" is part of the range.) Again, this is just an implementation detail. More interesting case: I'm tracking [A-C][E-F][H-K] and I delete [B-I]. Now I'm left with [A][J-K]. Suppose instead I insert [B-I]. Now I've got [A-K]. Hence my statement that ranges inside the insert/delete can be removed. — rici, Oct 04 '12 at 05:32

score 0 · Answer 3 · answered Oct 04 '12 at 05:35

Most languages, including Java and C++, have a some sort of ordered map or ordered set in which you can both look up a value and find the next value after or the first value before a value. You could use this as a building block - If it contains a set of disjoint ranges then it will have a least element of a range followed by a greatest element of a range followed by the least element of a range followed by the greatest element of a range and so on. When you add a range you can check to see if you have preserved this property. If not, you need to merge ranges. Similarly, you want to preserve this when you delete. Then you can query by just looking to see if there is a least element just before your query point and a greatest element just after.

If you want to create your own datastructure from scratch, I would think about some sort of radix trie structure, because this avoids doing lots of repeated string comparisons.

score 0 · Answer 4 · answered Aug 01 '16 at 03:08

I think you would go for B+ tree it's the same which you have mentioned as your second approach.

Here are some properties of B+ tree:

All data is stored leaf nodes.
Every leaf is at the same level.
All leaf nodes have links to other leaf nodes.

Here are few applications B+ tree:

It reduces the number of I/O operations required to find an element in the tree.
Often used in the implementation of database indexes.
The primary value of a B+ tree is in storing data for efficient retrieval in a block-oriented storage context — in particular, file systems.
NTFS uses B+ trees for directory indexing.

Basically it helps for range queries look ups, minimizes tree traversing.

Data structure for range query

4 Answers4

Linked