1

I'm trying to find the fastest solution to this simple problem. Right now I'm using a dictionary.

I have a bunch of string - int pairs that are constantly being thrown out for new ones. And I need to find the strings that match the n-highest ints (usually just 1, 2 or 3).

So the building of my data structure must be efficient, but also finding the max int and it's paired string is important.

Is a dictionary close to the best data structure? If so, should my ints be the keys or the values?

The language is python3 if it matters.

CSStudent7782
  • 618
  • 1
  • 5
  • 21
  • When you say these pairs are constantly being replaced, just how much is different? Are the strings new, or do you get the old strings with new integers ... or is it a new string with the old integer? Are the integers unique? These greatly affect the solution. – Prune Oct 19 '18 at 18:07
  • The integers are not unique. Basically I get a long unsorted list of about 200 entries, some are going to be the same, some will be different. I don't know which ones will change, so right now I throw the old one out and put in the new one. – CSStudent7782 Oct 19 '18 at 18:44

2 Answers2

1

Try a SortedList from the sortedcontainers module ("written in pure-Python, and fast as C-extensions").

At the core of Sorted Containers is the mutable sequence data type SortedList. The SortedList maintains its values in ascending sort order. As with Python’s built-in list data type, SortedList supports duplicate elements and fast random-access indexing.

Values may be added to a SortedList using either SortedList.update() or SortedList.add(). When doing so, the list remains sorted.

Because SortedList is sorted, it supports efficient lookups by value or by index.

Install the module if you don't have it:

$ pip install sortedcontainers

Store your values and strings as tuple pairs:

from sortedcontainers import SortedList

sorted_list = SortedList()

# Sample data.
sorted_list.update([
    (1, 'test'), 
    (1000, 'big number'), 
    (500, 'middle')])

>>> sorted_list[-1]
(1000, 'big number')

sorted_list.add((5000, 'even bigger'))
sorted_list.add((4000, 'big, but not biggest'))

# Get last two largest values.
>>> sorted_list[-2:]
[(4000, 'big, but not biggest'), (5000, 'even bigger')]
Community
  • 1
  • 1
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • `sorted([ (1, 'test'),(1000, 'big number'),(500, 'middle'),(5000, 'even bigger'),(4000, 'big, but not biggest'),],key=lambda kv:kv[0],reverse=True)[:2]` How is it different than this? – mad_ Oct 19 '18 at 18:15
  • Per your example, you have to sort the list every time you need an answer. The answer will be the same, but the time it takes to get it will be different. Try timings with a large list. – Alexander Oct 19 '18 at 18:17
  • so you mean SortedList is using something similar to insertion sort functionality?. I didn't know about this. It's good to learn new stuff though. – mad_ Oct 19 '18 at 18:19
  • Correct, the list is stored in sorted order (which marginally adds to insertion time). Retrieval of the largest or smallest values, however, will be very fast. – Alexander Oct 19 '18 at 18:21
1

And I need to find the strings that match the n-highest ints (usually just 1, 2 or 3).

You can use heapq with a dictionary. Below is an example extracting the strings associated with the 3 highest integers. Duplicate numbers are only included until the heap is full.

import heapq
from operator import itemgetter

L = [(1, 'test'), (1000, 'big number'), (500, 'middle'), (5000, 'even bigger'),
     (4000, 'big, but not biggest'), (5000, 'another even bigger')]

d = {v: k for k, v in L}

heap = heapq.nlargest(3, d.items(), key=itemgetter(1))

res = list(map(itemgetter(0), heap))

print(res)

['even bigger', 'another even bigger', 'big, but not biggest']

As described in this answer, the solution will have O(n log n) time complexity.

jpp
  • 159,742
  • 34
  • 281
  • 339