Is there a way to map strings to integers automatically?

Question

d = {'col': ['ana', 'ben', 'carl', 'dennis', 'earl', ...]}
df = pd.DataFrame(data = d)

I have an example dataframe here. Usually, if there are more than 5 unique values, OHE will not be used (correct me if I'm wrong).

Instead, mapping using a dictionary is used.

An example dictionary would be

dict = {'ana': 1, 'ben': 2, 'carl':, 3, ...}

Is there a library or any way to make this automatic (though manual mapping may be better as you know which values are mapped to which number)?

EDIT 1

Using ascii_lowercase, I am able to map single letter strings to integers. But as shown above, what if my strings are not single letters?

@jezrael not really the correct duplicate, I feel that your standards are lower when you are closing ;) — mozway, Dec 09 '22 at 07:25
@mozway - I think [this](https://stackoverflow.com/a/51466185/2901002) is 100% dupe - `{chr(i+96):i for i in range(1,27)}`. — jezrael, Dec 09 '22 at 07:28
@jezrael but not answering the question that was asked, anyway, just pointing out your difference of standards ;) — mozway, Dec 09 '22 at 07:31
@Wee no we're not, but now the question is completely different, you need to use [`pandas.factorize`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.html) — mozway, Dec 09 '22 at 07:33
@Wee what is your end goal? To generate a dictionary? or to map the values as a new column? — mozway, Dec 09 '22 at 07:37

mozway · Answer 1 · 2022-12-09T07:40:11.900

1

original question

You can generate the dictionary programatically using ascii.lowercase and enumerate in a dictionary comprehension:

from string import ascii_lowercase

dic = {k:v for v,k in enumerate(ascii_lowercase, start=1)}

Output:

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}

Then you can just map:

df['col'].map(dic)

edit: dictionary from an arbitrary Series of values

You can use pandas.factorize:

v,k = pd.factorize(df['col'])
dic = dict(zip(k, v+1))

Output: {'ana': 1, 'ben': 2, 'carl': 3, 'dennis': 4, 'earl': 5}

edited Dec 09 '22 at 07:40

answered Dec 09 '22 at 07:21

mozway

194,879
13
39
75

Does it go on infinitely? Like, lets say I have 20 values, does it stop at 20? And what if my values aren't a, b, c, but names instead? – Wee Liang Kelven Lim Dec 09 '22 at 07:25
@Wee it goes until 26 as there are 26 letters, if you use `ascii_letters` you'll get 52 letters – mozway Dec 09 '22 at 07:26
@Wee but you don't care if the dictionary is longer than needed, only the necessary values will be used. If you really want to limit: `dic = {k:v for v,k in enumerate(ascii_lowercase[:20], start=1)}` – mozway Dec 09 '22 at 07:31

Is there a way to map strings to integers automatically?

1 Answers1

original question

edit: dictionary from an arbitrary Series of values