1

I have this dataframe:

df = pd.DataFrame([['137', 'earn'], ['158', 'earn'],['144', 'ship'],['111', 'trade'],['132', 'trade']], columns=['value', 'topic'] )
print(df)
    value  topic
0   137   earn
1   158   earn
2   144   ship
3   111  trade
4   132  trade

And I want an additional numeric column like this:

    value  topic  topic_id
0   137   earn    0
1   158   earn    0
2   144   ship    1
3   111  trade    2
4   132  trade    2

So basically I want to generate a column which encodes a string column to a numeric value. I implemented this solution:

topics_dict = {}
topics = np.unique(df['topic']).tolist()
for i in range(len(topics)):
        topics_dict[topics[i]] = i
df['topic_id'] = [topics_dict[l] for l in df['topic']]

However, I am quite sure there is a more elegant and pandaic way to solve this but I couln't find something on Google or SO. I read about pandas' get_dummies but this creates multiple columns for each different value in the original column.

I am thankful for any help or pointer in a direction!

tbeck
  • 61
  • 8

5 Answers5

2

Option 1
pd.factorize

df['topic_id'] = pd.factorize(df.topic)[0]
df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

Option 2
np.unique

_, v = np.unique(df.topic, return_inverse=True)
df['topic_id'] = v

df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

Option 3
pd.Categorical

df['topic_id'] = pd.Categorical(df.topic).codes
df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

Option 4
dfGroupBy.ngroup

df['topic_id'] = df.groupby('topic').ngroup()
df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Very useful, thanks. Due to my missing reputation I can't upvote – tbeck Nov 01 '17 at 10:22
  • @T.Beck I thought you meant to accept mine first, since that's what you did :-) That's why I pointed out, when you accept someone's answer, then accept someone else, the other acceptance is undone. If you meant to tick that answer, that's fine. But if you meant to tick this, do realise that it got undone. – cs95 Nov 01 '17 at 10:23
  • Just got it now ;) – tbeck Nov 01 '17 at 10:36
  • `df.groupby('topic').ngroup()` is not working with python3. The error is: `AttributeError: 'DataFrameGroupBy' object has no attribute 'ngroup'` – rnso Jan 06 '18 at 01:18
  • @rnso Update to a newer version. – cs95 Dec 14 '18 at 03:58
1

You can use

In [63]: df['topic'].astype('category').cat.codes
Out[63]:
0    0
1    0
2    1
3    2
4    2
dtype: int8
Zero
  • 74,117
  • 18
  • 147
  • 154
0

we can use apply function to create new column based on existing column as shown below.

topic_list = list(df["topic"].unique()) df['topic_id'] = df.apply(lambda row: topic_list.index(row["topic"]),axis=1)

karthik reddy
  • 479
  • 4
  • 12
0

One can use for loops and list comprehension to determine list of codes:

ucols = pd.unique(df.topic)
df['topic_id'] = [ j
                for i in range(len(df.topic))
                for j in range(len(ucols))
                if df.topic[i] == ucols[j]  ]
print(df)

Output:

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2
rnso
  • 23,686
  • 25
  • 112
  • 234
-1

Try this code

 df['topic_id'] = pd.Series([0,0,1,2,2], index=df.index)

It works good

   value  topic
0   137   earn
1   158   earn
2   144   ship
3   111  trade
4   132  trade
  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2
FunnyCoder
  • 403
  • 1
  • 4
  • 8