4

I have a list of non-unique strings:

list = ["a", "b", "c", "a", "a", "d", "b"]

I would like to replace each element with an integer key which uniquely identifies each string:

list = [0, 1, 2, 0, 0, 3, 1]

The number does not matter, as long as it is a unique identifier.

So far all I can think to do is copy the list to a set, and use the index of the set to reference the list. I'm sure there's a better way though.

Rachie
  • 433
  • 1
  • 6
  • 17
  • Are all of the "strings" single characters as you have here? If so, you could consider using the [ord](https://docs.python.org/2/library/functions.html#ord) function. [Sets](https://docs.python.org/2/library/sets.html) do not support indexing. – rkersh Jun 02 '16 at 22:36
  • Not necessarily, no. – Rachie Jun 02 '16 at 22:38
  • 1
    BTW, don't use `list` as a variable name, as that shadows the built-in `list` type. It won't hurt anything here, but it can lead to mysterious bugs if your script later tries to use the `list` type to construct a list. – PM 2Ring Jun 02 '16 at 22:55

5 Answers5

10

This will guarantee uniqueness and that the id's are contiguous starting from 0:

id_s = {c: i for i, c in enumerate(set(list))}
li = [id_s[c] for c in list]

On a different note, you should not use 'list' as variable name because it will shadow the built-in type list.

user2390182
  • 72,016
  • 6
  • 67
  • 89
5

Here's a single pass solution with defaultdict:

from collections import defaultdict
seen = defaultdict()
seen.default_factory = lambda: len(seen)  # you could instead bind to seen.__len__

In [11]: [seen[c] for c in list]
Out[11]: [0, 1, 2, 0, 0, 3, 1]

It's kind of a trick but worth mentioning!


An alternative, suggested by @user2357112 in a related question/answer, is to increment with itertools.count. This allows you to do this just in the constructor:

from itertools import count
seen = defaultdict(count().__next__)  # .next in python 2

This may be preferable as the default_factory method won't look up seen in global scope.

Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 1
    Very clever, I like it! I had never thought about using that kind of reflexive powers in the `default_factory`. – user2390182 Jun 02 '16 at 23:07
  • @schwobaseggl I *guess* that's what the attributes is there for (rather than being private), still I had hoped they'd be a single constructor way to do it (and reference self)... it feels a little dirty/old school. :/ – Andy Hayden Jun 02 '16 at 23:11
  • 3
    [`itertools.count().next` also works](http://stackoverflow.com/questions/18605500/assign-strings-to-ids-in-python/18605520#18605520) for the `default_factory`, or you could use `seen = defaultdict(lambda: len(seen))`, since `seen` doesn't need to exist yet to create the lambda. I prefer `itertools.count().next` to `lambda: len(seen)`, since it doesn't require inspecting the dict's state in the middle of a mutative operation, but either version feels like there's too much magic going on in the `default_factory`. – user2357112 Jun 02 '16 at 23:30
  • @user2357112 I don't think it's too much magic, that's what it is it there for! It's annoying that the itertools.count api is different for python 3 (you need to use `__next__`) but I agree the itertools.count is much nicer that len (though both are O(1)). – Andy Hayden Jun 02 '16 at 23:37
  • @user2357112 I missed the lambda part... the worse part is it looks up the `seen` variable in scope (which could dirtily be avoided by binding to `seen.__len__` (if only [len were a proper oo method](http://stackoverflow.com/questions/237128/is-there-a-reason-python-strings-dont-have-a-string-length-method#comment9848314_237150)). This really needs to be created in a function to avoid that. Your solution is better! – Andy Hayden Jun 03 '16 at 01:08
  • to avoid the dichotomy between Python 2 and 3, consider `defaultdict(lambda: next(count))` where `count = itertools.count()`. – Adam Jun 05 '16 at 01:00
  • @codesparkle and if you do that consider defining it in a function (so that the count variable doesn't leak, like the seen variable above). – Andy Hayden Jun 05 '16 at 01:17
4
>>> lst = ["a", "b", "c", "a", "a", "d", "b"]
>>> nums = [ord(x) for x in lst]
>>> print(nums)
[97, 98, 99, 97, 97, 100, 98]
Tonechas
  • 13,398
  • 16
  • 46
  • 80
Chris
  • 15,819
  • 3
  • 24
  • 37
2

If you are not picky, then use the hash function: it returns an integer. For strings that are the same, it returns the same hash:

li = ["a", "b", "c", "a", "a", "d", "b"]
li = map(hash, li)                # Turn list of strings into list of ints
li = [hash(item) for item in li]  # Same as above
Hai Vu
  • 37,849
  • 11
  • 66
  • 93
1

A functional approach:

l = ["a", "b", "c", "a", "a", "d", "b", "abc", "def", "abc"]
from itertools import count
from operator import itemgetter

mapped = itemgetter(*l)(dict(zip(l, count())))

You could also use a simple generator function:

from itertools import count

def uniq_ident(l):
    cn,d  = count(), {}
    for ele in l:
        if ele not in d:
            c = next(cn)
            d[ele] = c
            yield c
        else:
            yield d[ele]


In [35]: l = ["a", "b", "c", "a", "a", "d", "b"]

In [36]: list(uniq_ident(l))
Out[36]: [0, 1, 2, 0, 0, 3, 1]
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321