5

I have a list of strings with some repeated. e.g. (not the actual list)

["hello", "goodbye", "hi", "how are you", "hi"]

I want to create a list of integers where each integer corresponds to a string. e.g. for the example above

[0, 1, 2, 3, 2]

where 0 = "hello", 1 = "goodbye" etc.

I looked at the example here: Convert a list of integer to a list of predefined strings in Python

I want to do basically the same thing but the other way around, strings to integers. That part shouldn't be too hard.

However, they seem to just create the dictionary in their code like this:

trans = {0: 'abc', 1: 'f', 2: 'z'}

Creating the dictionary yourself is fine when you know the exact contents of your list. My list of strings is extremely long and I don't know what the strings are as it comes from input. So I'd need to make the dictionary from my list of string some other way, like maybe a for loop.

I can't figure out how to make a dictionary that will map the strings in my list to numbers. I looked up how to make a dictionary with list comprehensions but I couldn't figure out how it deals with duplicates.

In other words, I'd like to know how to go through a list like my list of strings above and create a dictionary like:

{"hello": 0, "goodbye": 1, "hi": 2, "how are you": 3}

EDIT: I've had a lot of answers, thanks everyone for all your help. What I am now confused about is all the different ways of doing this. There have been a lot of suggestions, using enumerate(), set() and other functions. There was also one answer (@ChristianIacobs) that did it very simply with just a for loop. What I am wondering is whether there is any reason to use one of the slightly less simple answers? For instance, are they faster, or are there some situations where they are the only way that works?

IceWarrior42
  • 63
  • 1
  • 2
  • 5
  • `dict(enumerate(words))`? Or `{word: index for index, word in enumerate(words)}` for the reverse. That would give you the *last* index of each word. – jonrsharpe Jun 06 '19 at 09:59
  • So do you want duplicates to just be ignored, then? – MegaEmailman Jun 06 '19 at 09:59
  • `dict(zip(list_of_digits,list_of_strings))`? – yatu Jun 06 '19 at 10:00
  • @jonrsharpe, I'm not necessarily concerned about them being indices. I was basically wanting each unique string to have a unique integer so that the strings could be replaced with integers that correspond to them. – IceWarrior42 Jun 06 '19 at 10:03
  • @MegaEmailman, I'm just trying to make a dictionary that identifies each unique string with a unique integer. So the dictionary shouldn't have any duplicates in it. Then I can go through the list and make a new list that replaces each string with its number equivalent. – IceWarrior42 Jun 06 '19 at 10:05
  • @yatu I'm not quite sure what that does. I don't have a list of digits, my end goal is to create a list of integers (they won't all be one digit as my list of strings is long), but I need to create a dictionary to map the numbers to strings. – IceWarrior42 Jun 06 '19 at 10:06
  • @IceWarrior42 `dict(enumerate(set(l)))`? – yatu Jun 06 '19 at 10:07
  • Christian Iacob has posted a solution that seems very simple and I can't believe I didn't think of it--I've tested it and it seems to work. What is the difference/is there a reason why it might be better to use ```enumerate()``` or ```zip``` instead? Are they faster or something? – IceWarrior42 Jun 06 '19 at 10:16
  • I'm not sure whether or not to accept Christian Iacob's answer. It's very clear, but seems almost too good to be true since it's so simple. Do the other answers have any advantage? – IceWarrior42 Jun 06 '19 at 11:02
  • "In other words, I'd like to know how to go through a list like my list of strings above and create a dictionary like:" - This is not well defined. **Why should** the value for `"hello"` be `0`, and not some other integer? – Karl Knechtel Apr 04 '23 at 05:47

9 Answers9

4

To create a dictionary from your list you first need to get rid of duplicate values. Use a set to achieve that:

my_list = ["hello", "goodbye", "hi", "how are you", "hi"]
unique_list = list(set(my_list))

['hi', 'hello', 'goodbye', 'how are you']

Now you can create your dictionary by zipping the unique_list with a range of numbers:

my_dict = dict(zip(unique_list, range(len(unique_list))))

{'hi': 0, 'hello': 1, 'goodbye': 2, 'how are you': 3}
Peter
  • 10,959
  • 2
  • 30
  • 47
2

Try this:

>>> w = ["hello", "goodbye", "hi", "how are you", "hi"]
>>> l = [0, 1, 2, 3, 2]
>>> trans = {l1:w1 for w1,l1 in zip(w,l)}
>>> trans
{0: 'hello', 1: 'goodbye', 2: 'hi', 3: 'how are you'}
1
words = ["hello", "goodbye", "hi", "how are you", "hi"]

d = dict()
i = 0
for word in words:
    if word not in d:
        d[word] = i
        i += 1
print(d)
#print(sorted(d.items(), key=lambda kv: kv[1])) print them sorted
1

The ans in very simple. You can do it in just 2 lines.

The code is-

l = ['hello', 'goodbye', 'hi', 'how are you', 'hi']
{a: b for b,a in enumerate(l)}

Here enumerate create a tuple of (index, value) which is then Comprehend with the for loop

0

You can do it by these steps:

  • get rid of duplicate words, by using set
  • map unique words to a unique number (array index), by using enumerate
  • loop over words to get their assigned number

You can get the expected output by below snippet.

words = ["hello", "goodbye", "hi", "how are you", "hi"]
unique_words = set(words)
words_map = {word: i for i, word in enumerate(unique_words)}

result = [words_map[word] for word in words]
print(result)
mrzrm
  • 926
  • 7
  • 19
0

@jonrsharpe, I'm not necessarily concerned about them being indices. I was basically wanting each unique string to have a unique integer so that the strings could be replaced with integers that correspond to them.

Then the process is as follows:

  • determine the set of keys we need (each item in the original list).

  • Assign each a value - the easiest way is to make a list of that set again (since by definition, the elements are now unique) and use the index of the elements in that list. To build that mapping, we can use a trick with enumerate along the lines of what @jonrsharpe already proposed.

  • Translate the original list through the mapping.

Thus:

keys = list(set(original))
mapping = {k:v for v,k in enumerate(keys)}
result = [mapping[k] for k in original]
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • Actually, `enumerate` can be used directly on the `set(original)`, but this is - I think - clearer for pedagogical purposes. – Karl Knechtel Jun 06 '19 at 10:12
0

Here's my idea. It will be explained in comments. Assume you have a file containing nothing but the words.

import re         #Import the re module
phrases = {}       #Create a dictionary
file = open("/path/to/file", "r")       #Open the file containing all your phrases. 
Data = file.read()    #Read the file. 
cleanedData = re.split("[\s | \r | \n]", Data)    #Remove whitespace. 
for word in cleanedData:
    if not word in phrases:      #Check if the word is already in your dictionary. 
        phrases[word] = (len(phrases)+1)    #Sets the word as a key with a value starting at 1 and automatically increasing, but only adds it if it doesn't already exist. 
file.close()
MegaEmailman
  • 505
  • 3
  • 11
0

You can try something as follows:

vocab_dict = {word: index for index, word in enumerate(list(set(words)))}

Contents of the above vocab_dict given the words list is from the example mentioned would look something like below:

>> vocab_dict {'how are you': 0, 'hello': 1, 'goodbye': 2, 'hi': 3}

0
##**Simple program using map function to create dict**##
    list1 = ["hello", "goodbye", "hi", "how are you", "hi"]
    leng = (list(range(len(list1))))
    integ_map = map(lambda key,val:(key,val) ,list1,leng)
    print(dict(integ_map))
manjunath
  • 37
  • 4