4

I am reading a text file with python, formatted where the values in each column may be numeric or strings.

When those values are strings, I need to assign a unique ID of that string (unique across all the strings under the same column; the same ID must be assigned if the same string appears elsewhere under the same column).

What would be an efficient way to do it?

Bob
  • 10,741
  • 27
  • 89
  • 143

3 Answers3

12

Use a defaultdict with a default value factory that generates new ids:

ids = collections.defaultdict(itertools.count().next)
ids['a']  # 0
ids['b']  # 1
ids['a']  # 0

When you look up a key in a defaultdict, if it's not already present, the defaultdict calls a user-provided default value factory to get the value and stores it before returning it.

collections.count() creates an iterator that counts up from 0, so collections.count().next is a bound method that produces a new integer whenever you call it.

Combined, these tools produce a dict that returns a new integer whenever you look up something you've never looked up before.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • This is not what he asked? – Burhan Khalid Sep 04 '13 at 04:54
  • @BurhanKhalid: How so? – user2357112 Sep 04 '13 at 04:55
  • This should do exactly what he needs. If he's iterating through rows of data, each unique value gets a unique integer by simple insertion. Duplicate checking is built in. – g.d.d.c Sep 04 '13 at 04:59
  • He asked to only assign values to _strings_ not each column: _"**When those values are strings**, I need to assign a unique ID of that string (unique across all the strings under the same column; the same ID must be assigned if the same string appears elsewhere under the same column)."_ He can get a running counter by just enumerating over the file. – Burhan Khalid Sep 04 '13 at 05:01
  • @BurhanKhalid: So he looks up the string in the defaultdict and gets its ID. What's the problem? EDIT: Are you looking at an old version of the answer? The first version (gone now) just had the counter, without the defaultdict, since I didn't see the requirement of assigning the same ID to a string if it shows up twice. – user2357112 Sep 04 '13 at 05:03
  • Whoa, that's a neat use of `itertools.count` and `defaultdict`. Cool!! – nneonneo Sep 04 '13 at 05:08
  • You still haven't solved the problem. If the same string appears in two columns, it needs a different id. You have a solution (very neat, by the way), but it is not solving his problem. – Burhan Khalid Sep 04 '13 at 05:11
  • @user2357112 I dont understand this solution. Could you please explain? When I tried to execute this, for any value of `ids` I get `0`. – thefourtheye Sep 04 '13 at 05:14
  • @BurhanKhalid: Does it? I'm not sure. If so, you can make defaultdicts for each column. – user2357112 Sep 04 '13 at 05:51
  • @thefourtheye: Really? Odd. When you look up a key in a defaultdict, if it's not already present, the defaultdict calls a user-provided default value factory to get the value and stores it before returning it. `collections.count()` creates an iterator that counts up from 0, so `collections.count().next` is a bound method that produces a new integer whenever you call it. Combined, these tools produce a dict that returns a new integer whenever you look up something you've never looked up before. – user2357112 Sep 04 '13 at 06:01
  • @user2357112 Thats awesome. Please add this explanation to your answer. I ll upvote :) – thefourtheye Sep 04 '13 at 07:16
  • @thefourtheye: Explanation added to the answer. – user2357112 Sep 04 '13 at 07:22
  • @AndyHayden: Yeah, for Python 3, it's `__next__` instead of `next`. – user2357112 Jun 02 '16 at 23:34
2

defaultdict answer updated for python 3, where .next is now .__next__, and for pylint compliance, where using "magic" __*__ methods is discouraged:

ids = collections.defaultdict(functoools.partial(next, itertools.count()))
Greg Allen
  • 337
  • 2
  • 8
0

Create a set, and then add strings to the set. This will ensure that strings are not duplicated; then you can use enumerate to get a unique id of each string. Use this ID when you are writing the file out again.

Here I am assuming the second column is the one you want to scan for text or integers.

seen = set()
with open('somefile.txt') as f:
   reader = csv.reader(f, delimiter=',')
   for row in reader:
      try:
         int(row[1])
      except ValueError:
         seen.add(row[1]) # adds string to set

# print the unique ids for each string

for id,text in enumerate(seen):
    print("{}: {}".format(id, text))

Now you can take the same logic, and replicate it across each column of your file. If you know the column length in advanced, you can have a list of sets. Suppose the file has three columns:

unique_strings = [set(), set(), set()]

with open('file.txt') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
       for column,value in enumerate(row):
           try:
               int(value)
           except ValueError:
               # It is not an integer, so it must be
               # a string
               unique_strings[column].add(value)
Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284