1

I have a data frame with a categorical column(TweetType) with three categories (T, RT and RE). I want to count how many times these categories appear and then sum them. I created three new columns, respectively T, RT, and RE.

def tweet_type(df):
    result = df.copy()
    result['T'] = result['tweetType'].str.contains("T")
    result['RT'] = resulT['tweetType'].str.contains("RT")
    result['RE'] = result['tweetType'].str.contains("RE")
    return result
tweet_type(my_df)

Then I converted the boolean into 0 and 1. The problem is that the code matches T as RT and the result is not right.

What I obtain is:

TweetType RT  T  RE

RT        1   1   0

RE        0   0   1

T         1   0   0

RT        1   1   0
Sunflower
  • 53
  • 5

2 Answers2

2

Instead of str.contains you should use boolean eq for exact matches:

def tweet_type(df):
    result = df.copy()
    result['T'] = result['Tweet_Type'].eq("T")
    result['RT'] = result['Tweet_Type'].eq("RT")
    result['RE'] = result['Tweet_Type'].eq("RE")
    return result

However, there is an easier method for what you're trying to achieve. Why not use one-hot encoding using get_dummies to do this:

new_df = pd.get_dummies(df, columns=["Tweet_Type"])

If you don't want the prefix Tweet_Type_:

new_df = pd.get_dummies(df, columns=["Tweet_Type"], prefix='', prefix_sep='')

If you wish to retain the first column:

df = pd.concat([df, new_df], axis=1) 
Shivam Roy
  • 1,961
  • 3
  • 10
  • 23
0

You can use logical and to exclude the records that contain RT

(result['tweetType'].str.contains("T")) & (~result['tweetType'].str.contains("RT"))
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45