How to create new columns deriving from a categorical column in python?

Question

I have a data frame with a categorical column(TweetType) with three categories (T, RT and RE). I want to count how many times these categories appear and then sum them. I created three new columns, respectively T, RT, and RE.

def tweet_type(df):
    result = df.copy()
    result['T'] = result['tweetType'].str.contains("T")
    result['RT'] = resulT['tweetType'].str.contains("RT")
    result['RE'] = result['tweetType'].str.contains("RE")
    return result
tweet_type(my_df)

Then I converted the boolean into 0 and 1. The problem is that the code matches T as RT and the result is not right.

What I obtain is:

TweetType RT  T  RE

RT        1   1   0

RE        0   0   1

T         1   0   0

RT        1   1   0

Shivam Roy · Accepted Answer · 2021-05-15T10:46:15.157

Instead of str.contains you should use boolean eq for exact matches:

def tweet_type(df):
    result = df.copy()
    result['T'] = result['Tweet_Type'].eq("T")
    result['RT'] = result['Tweet_Type'].eq("RT")
    result['RE'] = result['Tweet_Type'].eq("RE")
    return result

However, there is an easier method for what you're trying to achieve. Why not use one-hot encoding using get_dummies to do this:

new_df = pd.get_dummies(df, columns=["Tweet_Type"])

If you don't want the prefix Tweet_Type_:

new_df = pd.get_dummies(df, columns=["Tweet_Type"], prefix='', prefix_sep='')

If you wish to retain the first column:

df = pd.concat([df, new_df], axis=1)

score 0 · Answer 2 · answered May 15 '21 at 10:26

0

You can use logical and to exclude the records that contain RT

(result['tweetType'].str.contains("T")) & (~result['tweetType'].str.contains("RT"))

answered May 15 '21 at 10:26

ThePyGuy

17,779
5
18
45

How to create new columns deriving from a categorical column in python?

2 Answers2