Data structure for large ranges of consecutive integers?

Question

Suppose you have a large range of consecutive integers in memory, each of which belongs to exactly one category. Two operations must be O(log n): moving a range from one category to another, and finding the category counts for a given range.

I'm pretty sure the second operation is solved trivially given a correct implementation for the first operation.

Each integer begins in a category, so I started with a set of balanced BSTs. Moving a subtree from one BST to another (eg, moving a range to a different category) has a runtime equivalent to merging two BSTs, which is O(n1 * n2)[1].

This is too slow (in python, and C is not an option), and I couldn't figure out a way to take advantage of my data's inherent structure to create an efficient BST merge operation.

I'm now looking at AVL, red-black, and interval trees, binary heaps, and treaps. Comparing their properties is overwhelming. Which structure should I use?

Edit (problem clarification): I am flexible on how I store these values and create my data structures. I am inflexible about how I receive my input, which comes from another application, and looks like the following: CATEGORY(cat3, I, J). My current solution creates a tree with a node for each integer in the range. This is too slow for the size of my dataset, so I'm happy to re-architect if given a better method.

Any given request can move any possible range of integers into any category. In other words, ranges are overlapping in the sense of CATEGORY(cat1, 1, 10) followed by CATEGORY(cat3, 5, 15), but non-overlapping in the sense that every integer will be in exactly one category at any given time.

Is the range saved as the list of integers or as a tuple like `(begin, end)`? — Andrea Ambu, Oct 04 '11 at 08:06
before you start implementing some trees, you should try to use the builtin datatypes like dicts and sets. These are highly optimized and very performant. — rocksportrocker, Oct 04 '11 at 08:24
So you have a list with tuples [ (number1, cat1), .... ] ??? — rocksportrocker, Oct 04 '11 at 08:25
I'm not sure I fully understand the problem, so, I'm going to ask a few questions first. How do you represent a range? As all integers in some data structure or just first+last (or first+count)? Do you ever have to split and join ranges while doing those operations? What is a category? Can it be described by some number? What does it mean to move between categories? — Alexey Frunze, Oct 04 '11 at 08:37
By moving a range from one category to another, you mean that for a range [x, y] 1) all integers x <= i <= y will be in the new category, or 2) only integers x <=i <= y *in some other category* will be in the new category? — jpalecek, Oct 04 '11 at 11:31
Not sure if editing my entry will ping the commenters. I've clarified the problem domain to specify that I currently store a node per integer, but I can change my code to store the range (eg, Node.start and Node.end). — knite, Oct 04 '11 at 17:08
Alex - A category can be represented any way you'd like. I'm currently using a Python list of integers to represent my tree nodes, like so: [LEFT RIGHT VAL CATEGORY], where LEFT and RIGHT are the left and right subtrees. @jpalecek - I receive any number of requests to move any possible range into one other category. Does my question edit answer your question? — knite, Oct 04 '11 at 17:14
How large is the range of consecutive integers you are dealing with? — Winston Ewert, Oct 05 '11 at 04:38
The link you give claiming O(n1 * n2) for merging 2 BSTs **actually gives several O(n1 + n2) answers!** Essentially: Convert each BST to a sorted list in linear time, merge these lists in linear time, and convert the result back to a BST in linear time. — j_random_hacker, Oct 05 '11 at 14:17
@j_random_hacker, he needs O(log n) time. The * is probably a typo. — Winston Ewert, Oct 05 '11 at 15:12

yo_man · Accepted Answer · 2011-10-04T18:27:19.137

As far as I understood the question you have a range [A, B] and queries of the form -

Assign a particular range to a category

E.g.
R1 R2 C1
R3 R4 C2

Query a range for the total number of items in particular categories. E.g. find count of categories in R1 R4

A simple implementation using dictionaries as given above would not work as I describe by this example -

suppose we have a range [1000, 5000]

and we make assignment as follows -

1 2 C1
2 3 C2
3 4 C3
......
4999 5000 C4999

Now we make the following assignment

1 5000 C5555

This will make updation/changes/deletion to previously assigned ranges of the ranges O(N) where N is maximum size of range (B - A)

D['category'] = set(of_all_the_ranges_you_have_in_category)

In this case deletion of previous ranges from corresponding sets in categories C1,C2...C4999 will be needed for last assignment ( 1 5000 C5555 )

OR

1 : { "stop" : 5, "category" : "C1"}, 6 : { "stop" : 19, "category" : "C23"},

Here updation of category for each starting value (1,2,3,4...,4999) will be required for last assignment ( 1 5000 C5555 )

A better option that will make updation of ranges in O(lg n) would be segment trees (http://en.wikipedia.org/wiki/Segment_tree )

For the above example the segment tree will look something like this

                   1000:5000
                      |
             ---------------------
             |                   |
           1000:3000         3001:5000
            |                    |
    ----------------      --------------------
   |               |      |                  |
 1000:2000     2001:3000   3001:4000       4001:5000

................................................................. ............................................................... and so on

The leaf nodes will have ranges (1:2, 2:3,...)

You can assign a value "category" to each node and given an interval traverse the tree dividing the interval appropriately ( e.g for 2500 to 4500 divide in 2500:3000 and 3001:4500 and proceed in both directions until nodes with matching ranges are found).

Now an interesting thing is update the children of the nodes when you need them. For example do not proceed updating the children immediately when doing assignments like 1 5000 C5555. This thing is called lazy propagation and you can learn more about it here (http://www.spoj.pl/forum/viewtopic.php?f=27&t=8296 ).

Now for query part. If the number of categories are very small, you can maintain a frequency table at each node and update the ranges when required and lazy propagate when needed otherwise, you will have to traverse the whole tree from leaf to node, counting and complexity will become O(n).

I think a better solution to querying may exist. It's not coming to my mind.

UPDATE Lets take a small example.

Range [1,8]

Categories allowed {C1, C2}

        1:8
     1:4         5:8
     1:2  3:4      5:6    7:8
 1:1 2:2 3:3 4:4  5:5 6:6 7:7 8:8

Each node will have 3 fields [category, category_counts[], children_update_required = false]

1 5 C1

Query will get divided and nodes 1:4's category will be set to C1 and children_update_required will be set to true, its children will not be updated now (remember update only when required or lazy propagation). Also node 5:5's category will be set to C1

3 4 C2

Query will propagate along the tree towards 3:4 (and in the process to reach 3:4, 1:2 and 3:4's categories will be updated to C1, 1:4's children_update_required will be set to false, 1:2's and 3:4's children_update_required will be set to true) and now will update the category of 3:4 to C2 according to current requirement. Next it will set children_update required of 3:4 to be true for future updation of its children (which is already set in this case..no harm done).

Will this work if I receive requests for different ranges? Eg, `CATEGORY(cat3, 1, 10)` and then `CATEGORY(cat1, 5, 7)`? Segment trees need to be static ranges. Yes, the number of categories is very small (<10), compared to the ranges (millions of numbers with ranges of hundreds of thousands). — knite, Oct 04 '11 at 17:29
Yup it will. The ranges will indeed remain static. You just have to modify category data associated with the corresponding node. I am adding a small example in the existing answer to clarify. I do not know if this will help but I solved a problem using segment tree in C++ (without using lazy propagation although). You could refer the update and query code there.They will change according to the requirements. Problem - [link](http://www.codechef.com/APRIL11/problems/SPREAD) Code - [link](http://www.codechef.com/viewsolution/510764) — yo_man, Oct 04 '11 at 18:20

hymloth · Answer 2 · 2011-10-05T07:55:08.413

You can have a plain python dictionary of the following form

1 : { "stop" : 5, "category" : "C1"},
6 : { "stop" : 19, "category" : "C23"},
etc

The keys here are the start of your range and the values contain the end of the range and the category this range falls into.

Because dictionaries have constant time for accessing items, you can write some code that moves a range to another category easily and efficiently: In the worst case, you'll have to restructure somehow your dictionary, if your range splits the previous ranges in two or more. For example, if you want to assign the range of (4, 8) into another category, you'll end up with:

1 : { "stop" : 3, "category" : "C1"},
4 : { "stop" : 8, "category" : "new category"},
9 : { "stop" : 19, "category" : "C23"},
etc

Finding the category count is trivial, just collect all your desired ranges in constant time and count the categories..

EDIT: To successfully find the lowest (highest) key to start performing computations/alterations, you also need a plain python list with all keys sorted, and the bisect module. This will help locating the index in the list to "put" the start of the range in O(logn) time, then everything can be done in constant time, except of the O(n) time needed to insert the new key into the list with bisect.insort_left.

How do you store two ranges starting on the same integer? I mean what happens if you have the range `(4, 8)` and the range `(4, 10)`? — Andrea Ambu, Oct 04 '11 at 09:04
@AndreaAmbu From the question: "each of which belongs to exactly one category" — rplnt, Oct 04 '11 at 09:10
@rplnt I think I'm missing the subject there, I understood it was each _range_ to belong to a particular category as the OP defines the operations in term of ranges, but in that sentence _each_ stands for _number_. Looks ambiguous to me, thanks. — Andrea Ambu, Oct 04 '11 at 09:40

score 0 · Answer 3 · answered Oct 04 '11 at 17:43

You can build a tree structure on arrays of consecutive integers quite easily, which should help with your constant factor. First renumber the sequence to start at 0, and work out which is the smallest power of two that is larger than the range of the sequence.

Now consider the following tree formed from the integers 0-7, which I can hold as four arrays, each array going horizontally.

            (all)
     0-3      4-7
   0-1 2-3  4-5   6-7
 0 1  2  3  4  5  6   7

Given a number and a level, I can find out an offset in the array for that level, just by shifting the number right an amount depending on the level.

In each element I can put a marker that either says "mixed" or gives the category for every element at or under that node of the tree. I can work out what category a node is in by following the path down from the root of the tree to a leaf, stopping as soon as I see a node that does not say "mixed". I can change the category for an interval of numbers in time lg n, because I have at most two nodes at each level to represent the category: if I had three, two of them would have the same parent and I could merge them. You may have to fiddle a bit with the edges to get neighbouring ranges correct, but I think this works in lg n time.

This is a very interesting approach! It requires much more space, but that shouldn't be an issue. I hope that both operations (category move and category count) are O(log n) as you propose. — knite, Oct 05 '11 at 04:35
The only new idea here is the tree structure, which as you say trades best-case space for simplicity. In the case where the categories are "even number" and "odd number" it may be competitive in space. For category count you will need to store summary info at tree nodes, but if you can solve this problem with any of the tree structures you mentioned you should be able to solve it with this. — mcdowella, Oct 05 '11 at 05:18

score 0 · Answer 4 · answered Oct 04 '11 at 17:48

Assumptions:

Any integer can belong to exactly one category, so ranges cannot intersect.
All integers in an incoming range belong to one category.
Sometimes you need to split a range to move a subrange to a different category.

Represent ranges by (start, end, category) tuples. Ranges don't intersect so you can build a tree of them. It's far more economical than a tree of individual integers. To order ranges (that is, nodes) you can just use the start value, for it's unique and cannot belong to another range.

If you have to move a range [a, b] to another category, you have to:

Scan your tree and update every node/range that is fully included in [a, b] range. It's a simple depth-first traversal. During traversal

If current_node.start < a or current_node.start > b, stop the search.
If current_node.start >= a and current_node.end > b, you have to split current_node in two; [current_node.start, b] will belong to a new category, the rest will be of its original category.
If current_node.start < a and current_node.end <= b, you split the other way around.

Tree search is logarithmic, and node splits are O(1).

If your tree ever gets too fragmented, you could consider merging nodes that have adjacent ranges. This will always be a parent and a child resulting either from an insert or a split; checks and joins seem to be always O(1).

score 0 · Answer 5 · answered Oct 04 '11 at 18:27

We can represent the current states as something like:

0:cat1 200:cat2 500: cat0 700:cat6 800:cat1

This means 0-200 is cat1, 200-500 is cat2, etc. We store the values in a binary search tree keyed on the starting number. Each node will also contain a dictionary mapping categories to counts for all children of that node.

Those dictionaries should make it easy to obtain the counts in O(log) time. We simply have to add the correct numbers on the paths to the start and stop of the sequence in question.

What do we do when we need set the sequence X-Y to category Z?

Determine the current category of Y O(logn )
Remove all values between X -Y O(k), but since the cost of inserting those nodes is more expensive, we can call it O(1) amortized.
Insert new node at X O(log n)
Update category count dictionaries. We should only need to update the O(log n) parents of the affected sections so this is O(log n)

Altogether this is O(log n) time.

Could you clarify #2? I agree that this is O(k), but I'm unclear as to how it can be O(1) amortized. An average value of k will be on the same order as O(n), which would make this step O(n). — knite, Oct 05 '11 at 04:30
@knite, it is amortized when considering it along with the insertion. Every removal has to been preceeded by an insertion. So spending an additional O(1) to remove it is the same as spending an additional O(1) to insert it. Insertion is already O(log n) so we can disregard it. — Winston Ewert, Oct 05 '11 at 04:37
@knite, honestly implementing complex data structures like thus tends not to work well in python. — Winston Ewert, Oct 05 '11 at 04:38

Data structure for large ranges of consecutive integers?

5 Answers5

Linked