Add numeric column to pandas dataframe based on other textual column

Question

I have this dataframe:

df = pd.DataFrame([['137', 'earn'], ['158', 'earn'],['144', 'ship'],['111', 'trade'],['132', 'trade']], columns=['value', 'topic'] )
print(df)
    value  topic
0   137   earn
1   158   earn
2   144   ship
3   111  trade
4   132  trade

And I want an additional numeric column like this:

    value  topic  topic_id
0   137   earn    0
1   158   earn    0
2   144   ship    1
3   111  trade    2
4   132  trade    2

So basically I want to generate a column which encodes a string column to a numeric value. I implemented this solution:

topics_dict = {}
topics = np.unique(df['topic']).tolist()
for i in range(len(topics)):
        topics_dict[topics[i]] = i
df['topic_id'] = [topics_dict[l] for l in df['topic']]

However, I am quite sure there is a more elegant and pandaic way to solve this but I couln't find something on Google or SO. I read about pandas' get_dummies but this creates multiple columns for each different value in the original column.

I am thankful for any help or pointer in a direction!

cs95 · Answer 1 · 2017-11-01T10:03:05.810

2

Option 1
pd.factorize

df['topic_id'] = pd.factorize(df.topic)[0]
df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

Option 2
np.unique

_, v = np.unique(df.topic, return_inverse=True)
df['topic_id'] = v

df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

Option 3
pd.Categorical

df['topic_id'] = pd.Categorical(df.topic).codes
df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

Option 4
dfGroupBy.ngroup

df['topic_id'] = df.groupby('topic').ngroup()
df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

edited Nov 01 '17 at 10:03

answered Nov 01 '17 at 09:57

cs95

379,657
97
704
746

Very useful, thanks. Due to my missing reputation I can't upvote – tbeck Nov 01 '17 at 10:22
@T.Beck I thought you meant to accept mine first, since that's what you did :-) That's why I pointed out, when you accept someone's answer, then accept someone else, the other acceptance is undone. If you meant to tick that answer, that's fine. But if you meant to tick this, do realise that it got undone. – cs95 Nov 01 '17 at 10:23
Just got it now ;) – tbeck Nov 01 '17 at 10:36
`df.groupby('topic').ngroup()` is not working with python3. The error is: `AttributeError: 'DataFrameGroupBy' object has no attribute 'ngroup'` – rnso Jan 06 '18 at 01:18
@rnso Update to a newer version. – cs95 Dec 14 '18 at 03:58

score 1 · Accepted Answer · answered Nov 01 '17 at 09:56

1

You can use

In [63]: df['topic'].astype('category').cat.codes
Out[63]:
0    0
1    0
2    1
3    2
4    2
dtype: int8

answered Nov 01 '17 at 09:56

Zero

74,117
18
147
154

stumbled upon Categories before but didn't think of simply converting it. Nice! – tbeck Nov 01 '17 at 10:17

karthik reddy · Answer 3 · 2017-11-01T10:16:17.027

0

we can use apply function to create new column based on existing column as shown below.

topic_list = list(df["topic"].unique()) df['topic_id'] = df.apply(lambda row: topic_list.index(row["topic"]),axis=1)

edited Nov 01 '17 at 10:16

answered Nov 01 '17 at 10:08

karthik reddy

479
4
12

score 0 · Answer 4 · answered Jan 06 '18 at 01:52

One can use for loops and list comprehension to determine list of codes:

ucols = pd.unique(df.topic)
df['topic_id'] = [ j
                for i in range(len(df.topic))
                for j in range(len(ucols))
                if df.topic[i] == ucols[j]  ]
print(df)

Output:

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

score -1 · Answer 5 · answered Nov 01 '17 at 10:01

-1

Try this code

 df['topic_id'] = pd.Series([0,0,1,2,2], index=df.index)

It works good

   value  topic
0   137   earn
1   158   earn
2   144   ship
3   111  trade
4   132  trade
  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

answered Nov 01 '17 at 10:01

FunnyCoder

403
1
4
8

Good luck with this if you have a million rows. – cs95 Nov 01 '17 at 10:02
We can modify the thing in [0,0,1,2,2] and it can be random series or any list. – FunnyCoder Nov 01 '17 at 10:12

Add numeric column to pandas dataframe based on other textual column

5 Answers5

Related