-1
d = {'col': ['ana', 'ben', 'carl', 'dennis', 'earl', ...]}
df = pd.DataFrame(data = d)

I have an example dataframe here. Usually, if there are more than 5 unique values, OHE will not be used (correct me if I'm wrong).

Instead, mapping using a dictionary is used.

An example dictionary would be

dict = {'ana': 1, 'ben': 2, 'carl':, 3, ...}

Is there a library or any way to make this automatic (though manual mapping may be better as you know which values are mapped to which number)?

EDIT 1

Using ascii_lowercase, I am able to map single letter strings to integers. But as shown above, what if my strings are not single letters?

1 Answers1

1

original question

You can generate the dictionary programatically using ascii.lowercase and enumerate in a dictionary comprehension:

from string import ascii_lowercase

dic = {k:v for v,k in enumerate(ascii_lowercase, start=1)}

Output:

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}

Then you can just map:

df['col'].map(dic)

edit: dictionary from an arbitrary Series of values

You can use pandas.factorize:

v,k = pd.factorize(df['col'])
dic = dict(zip(k, v+1))

Output: {'ana': 1, 'ben': 2, 'carl': 3, 'dennis': 4, 'earl': 5}

mozway
  • 194,879
  • 13
  • 39
  • 75
  • Does it go on infinitely? Like, lets say I have 20 values, does it stop at 20? And what if my values aren't a, b, c, but names instead? – Wee Liang Kelven Lim Dec 09 '22 at 07:25
  • @Wee it goes until 26 as there are 26 letters, if you use `ascii_letters` you'll get 52 letters – mozway Dec 09 '22 at 07:26
  • @Wee but you don't care if the dictionary is longer than needed, only the necessary values will be used. If you really want to limit: `dic = {k:v for v,k in enumerate(ascii_lowercase[:20], start=1)}` – mozway Dec 09 '22 at 07:31