1

I have quite a big amount of fixed size records. Each record has lots of fields, ID and Value are among them. I am wondering what kind of data structure would be best so that I can

  1. locate a record by ID(unique) very fast,

  2. list the 100 records with the biggest values.

Max-heap seems work, but far from perfect; do you have a smarter solution?

Thank you.

Svante
  • 50,694
  • 11
  • 78
  • 122
HourseArmor
  • 139
  • 1
  • 2
  • 5
  • Locating a record by ID seems irrelevant if I understand your question, because you're still going to have to scan the entire row set and sort the value to determine the top 100, no? – Chris Dec 16 '09 at 07:53
  • Well actually these are two different requirements : 1) to be able to return a record by its ID very quickly and 2) to be able to list the top 100 value records – HourseArmor Dec 16 '09 at 08:05
  • Do you need to support efficient insertions? Deletes? Will the records fit into memory or will you need to go to disk? – Ants Aasma Dec 16 '09 at 08:15
  • Efficient insertions is a must,deletion happens not very often but better to be efficient too otherwise it could affect the data retrieving.Memory is not an issue,all records can fit into RAM. – HourseArmor Dec 16 '09 at 08:20

3 Answers3

2

A hybrid data structure will most likely be best. For efficient lookup by ID a good structure is obviously a hash-table. To support top-100 iteration a max-heap or a binary tree is a good fit. When inserting and deleting you just do the operation on both structures. If the 100 for the iteration case is fixed, iteration happens often and insertions/deletions aren't heavily skewed to the top-100, just keep the top 100 as a sorted array with an overflow to a max-heap. That won't modify the big-O complexity of the structure, but it will give a really good constant factor speed-up for the iteration case.

Ants Aasma
  • 53,288
  • 15
  • 90
  • 97
  • If the IDs are unique and dense, an array is better than a hash table, if unique and sparse, a B-tree is often used. – Svante Dec 16 '09 at 09:23
0

Max heap would match the second requirement, but hash maps or balanced search trees would be better for the first one. Make the choice based on frequency of these operations. How often would you need to locate a single item by ID and how often would you need to retrieve top 100 items?

Pseudo code:

add(Item t)
{
    //Add the same object instance to both data structures
    heap.add(t);
    hash.add(t);
}
remove(int id)
{
    heap.removeItemWithId(id);//this is gonna be slow
    hash.remove(id);
}
getTopN(int n)
{
    return heap.topNitems(n);
}
getItemById(int id)
{
    return hash.getItemById(id);
}
updateValue(int id, String value)
{
    Item t = hash.getItemById(id);
    //now t is the same object referred to by the heap and hash
    t.value = value;
    //updated both.
}
Amarghosh
  • 58,710
  • 11
  • 92
  • 121
  • >How often That's the problem. I will have roughly 50 requests per second and what kind of request will be more varies according to time.Allocate memory for each record,have one ID->record pointer hash map/balanced search tree to serve the first requirement, and have a value->record pointer max-heap to satisfy the second requirement? – HourseArmor Dec 16 '09 at 08:17
  • If modifications are not so frequent, I believe using a hash for id-based retrievals and a heap for sorted list wouldn't be a bad idea - can't say for sure without testing though. – Amarghosh Dec 16 '09 at 09:22
  • If the hash and heap are both referring to the same object instance, you can modify values using an id-based retrieval from the hash and the changes made will be automatically reflected in the heap's references too. – Amarghosh Dec 16 '09 at 09:37
  • but won't that affect the value based max-heap? – HourseArmor Dec 16 '09 at 09:45
  • not if both heap and hash are referring to the same instance of the data object. see the updated pseudo code – Amarghosh Dec 16 '09 at 10:20
  • There is an added overhead in terms of the speed of add/remove operations. Memory overhead would be minimal as only one copy of data object exists per item (both heap and hash points to the same item in the memory). – Amarghosh Dec 16 '09 at 11:05
0

I know you want pseudo-code algorithm, but in Java for example i would use TreeSet, add all the records by ID,value pairs.

The Tree will add them sorted by value, so querying the first 100 will give you the top 100. Retrieving by ID will be straight-forward.

I think the algorithm is called Binary-Tree or Balanced Tree not sure.

mohdajami
  • 9,604
  • 3
  • 32
  • 53