2

Requirement: I have a big set of unique strings. I need to assign an unique int id to each of them. After that, either get id from string or get string from id should be efficient enough (both memory and speed).

If in c/c++, one can store these strings in a hash table (an array of const char * for example), and assign the index of a string in the table as the id.

Is it possible do the same thing or other solution in python? Otherwise I need to maintain two dict that map strings to ids and vice vasa.

Update: the set is frozend, no need to change.

jayven
  • 770
  • 8
  • 19

2 Answers2

1

If only string -> id is enough, just use hash function:

In [2]: hash( 'hello' )
Out[2]: 840651671246116861

In [3]: hash( 'helloo' )
Out[3]: -827725961091893887

If you need both ways, as @njzk2 suggested:

values = {hash(value): value for value in string_list}
# from id -> string:
values[id]
# from string -> id:
hash(string)

If you are cautious for collisions in hashing and your data is static, you can check if there are any collisions:

hashes = set()
for value in string_list:
   hashed = hash(value)
   if hashed in hashes:
      print('at least one collision in hashing')
      break
   hashes.add(hashed)
print('no collisions at hashing')

If you have any collisions, which is very unlikely, you can do:

myDict1 = {} # string --> id dictonary
myDict2 = {} # id --> string dictionary

counter = 0
for value in string_list:
   myDict1[value] = counter
   myDict2[counter] = value
   counter += 1
Sait
  • 19,045
  • 18
  • 72
  • 99
  • 1
    why would you need 2 dictionnaries? `values = {hash(value): value for value in string_list}` is enough. (since, as you pointed out, you can always obtain the hash from the string) – njzk2 Jul 09 '15 at 04:04
  • 1
    also, as per the question `I have a big set of unique strings` – njzk2 Jul 09 '15 at 04:05
  • @njzk2 Good catch! Do you want to add a new answer, or should I update mine? – Sait Jul 09 '15 at 04:05
  • 1
    But hash(string) maybe collide for two different strings. – jayven Jul 09 '15 at 04:43
  • This is great if you don't need to insert and delete. Otherwise, it's best to use the suggested two-way-reverse-map in the questions comment. – Ross Jul 09 '15 at 05:09
  • @Ross, yes, the data set is frozend, no need to change. – jayven Jul 09 '15 at 05:14
  • @Sait using two dict is the final choice if there is no other more memory efficient solution :) – jayven Jul 09 '15 at 05:18
  • @jayven: I updated my answer for checking whether you have any collisions in hashing. I would go with hashing if you don't have any collisions since you are looking for the best memory-efficient way. – Sait Jul 09 '15 at 05:30
  • @Sait ha! I think I figure out how to handle the collision condition: store data as `{probe(hash(s)): s}` , the probe function is to avoid collision, for example a linear probe function: if the hash value is already taken by another string, then check value +1 and so on. – jayven Jul 09 '15 at 05:50
  • @jayven Are you sure you will be able to `recover` the same thing again? I doubt it. If you think so, please post another answer and answer your own question in more detail. – Sait Jul 09 '15 at 05:54
  • hi @Sai, i've posted my answer, is it ok? – jayven Jul 09 '15 at 06:53
1

Here is my solution now, using only one dict.

class StringTable(object):

    def __init__(self):
        self.table = {}

    def probe_id(self, ss):
        table = self.table
        id_ = hash(ss)

        while True:
            s = table.get(id_, None)

            # id_ not use, return it
            if s is None:
                return id_, False

            # id_ used and equal to ss
            if ss == s:
                return id_, True

            # id_ used by other string, probe again!
            id_ = id_ + 1

            # don't use this, endless loop, why?
            #id_ = hash(id_)

    def store(self, ss):
        id_, occupied = self.probe_id(ss)
        if not occupied:
            self.table[id_] = ss
            print 'store {%s: %r}' % (id_, ss)
        return id_

    def get_string_from_id(self, id_):
        ret = self.table.get(id_, None)
        print 'get_string_from_id %s -> %r' % (id_, ret)
        return ret

    def get_id_from_string(self, ss):
        id_, occupied = self.probe_id(ss)
        ret = id_ if occupied else None
        print 'get_id_from_string %r -> %s' % (ss, ret)
        return ret


st = StringTable()
# http://bugs.python.org/issue14621
s1 = "\x00\xcf\x0b\x96\x19"
s2 = "\x00\x6d\x29\x45\x18"

print repr(s1), hash(s1)
print repr(s2), hash(s2)

id1 = st.store(s1)
id2 = st.store(s2)

st.get_string_from_id(id1)
st.get_string_from_id(id2)

st.get_id_from_string(s1)
st.get_id_from_string(s2)

Output:

jayven@laptop:~/test$ python stringtable.py
'\x00\xcf\x0b\x96\x19' 1220138288
'\x00m)E\x18' 1220138288
store {1220138288: '\x00\xcf\x0b\x96\x19'}
store {1220138289: '\x00m)E\x18'}
get_string_from_id 1220138288 -> '\x00\xcf\x0b\x96\x19'
get_string_from_id 1220138289 -> '\x00m)E\x18'
get_id_from_string '\x00\xcf\x0b\x96\x19' -> 1220138288
get_id_from_string '\x00m)E\x18' -> 1220138289
jayven
  • 770
  • 8
  • 19