2

I am looking for a sentiment analysis code with atleast 80%+ accuracy. I tried Vader and it I found it easy and usable, however it was giving accuracy of 64% only.

Now, I was looking at some BERT models and I noticed it needs to be re-trained? Is that correct? Isn't it pre-trained? is re-training necessary?

Shrumo
  • 47
  • 7
  • 1
    I doubt you'll have the time or resources to re-train BERT model from scratch. What you need is to use the weights of pre-trained BERT and then train model on your own data but not starting from scratch. – NotAName Oct 21 '21 at 05:16
  • Okay, so ultimately training the model is needed at our end ? Because in Vader all those steps are not needed, its ready-to-use sort of a model, but then the accuracy is pretty low.. – Shrumo Oct 21 '21 at 05:27
  • 1
    Yes, training may be needed to adjust pre-trained model to your own data. It depends on how well it works with your data. – NotAName Oct 21 '21 at 05:29
  • okay, got it ! thanks ! – Shrumo Oct 21 '21 at 05:35
  • 1
    There are state-of-art models in HuggingFace that are extraordinarily good. Example 26 sentiments (https://huggingface.co/arpanghoshal/EmoRoBERTa) and simple (https://huggingface.co/finiteautomata/bertweet-base-sentiment-analysis) – Prayson W. Daniel Oct 21 '21 at 05:41
  • @PraysonW.Daniel is RoBERTa exclusive to analyzing the emotions (angry, sad, happy, etc.) only? or is it usable for strings as well? – Shrumo Oct 21 '21 at 05:47
  • Yes, it can be use for any texts – Prayson W. Daniel Oct 21 '21 at 14:32
  • @PraysonW.Daniel okay, thanks ! – Shrumo Oct 26 '21 at 04:39
  • @PraysonW.Daniel Regarding the links you shared, is there any way to use EmoRoBERTa in pandas column? the code has 'pipeline' function, which I am guessing is for single texts.. I have been trying to apply it to a pandas dataframe but it's not working so far.. any ideas? – Shrumo Oct 27 '21 at 04:40
  • Yes, I can write a demo script as an answer. What will you like as a result, just the emotion with a high score, or top N emotions? – Prayson W. Daniel Oct 27 '21 at 05:12
  • 1
    @PraysonW.Daniel I want the script to give me an additional column in the pandas dataframe labelled "Emotions" which should give emotion for each row.. something like that possisble? i was trying this code to convert the pandas column to string and then apply, didnt work : ` Data['Translated'] = str(Data[Translated']) Emotion_labels = Data['Translated'] Emotion = emotion(Emotion_labels) print(Emotion)` – Shrumo Oct 27 '21 at 05:59
  • Don't train, go with fine tune. Like for it in the hugging face library (sorry I can't post links, I'm on the phone) – SilentCloud Oct 27 '21 at 07:46
  • @PraysonW.Daniel hi, the hugging face model that you suggested (https://huggingface.co/finiteautomata/bertweet-base-sentiment-analysis) , how can I find the output label for this model? I wanted to try this model on my pandas column, but I am unable to apply. Could you please help me locate what object should I call for this? – Shrumo Jan 18 '22 at 13:35
  • 1
    ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification \n tokenizer = AutoTokenizer.from_pretrained("finiteautomata/bertweet-base-sentiment-analysis") model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/bertweet-base-sentiment-analysis") ``` – Prayson W. Daniel Jan 18 '22 at 17:35
  • @PraysonW.Daniel Thanks ! it worked.. also, on HuggingFace, how can we identify the pipeline, if nothing is mentioned on the model card? – Shrumo Jan 19 '22 at 09:05
  • I usually just browse HuggingFace searching for keywords :) – Prayson W. Daniel Jan 19 '22 at 13:59
  • @PraysonW.Daniel oh , okay ! thanks :) – Shrumo Jan 20 '22 at 11:21

2 Answers2

1

You can use pre-trained models from HuggingFace. There are plenty to choose from. Search for emotion or sentiment models

Here is an example of a model with 26 emotions. The current implementation works but is very slow for large datasets.

import pandas as pd
from transformers import RobertaTokenizerFast, TFRobertaForSequenceClassification, pipeline

tokenizer = RobertaTokenizerFast.from_pretrained("arpanghoshal/EmoRoBERTa")
model = TFRobertaForSequenceClassification.from_pretrained("arpanghoshal/EmoRoBERTa")


emotion = pipeline('sentiment-analysis', 
                    model='arpanghoshal/EmoRoBERTa')

# example data
DATA_URI = "https://github.com/AFAgarap/ecommerce-reviews-analysis/raw/master/Womens%20Clothing%20E-Commerce%20Reviews.csv"
dataf = pd.read_csv(DATA_URI, usecols=["Review Text",])

# This is super slow, I will find a better optimization ASAP


dataf = (dataf
         .head(50) # comment this out for the whole dataset
         .assign(Emotion = lambda d: (d["Review Text"]
                                       .fillna("")
                                       .map(lambda x: emotion(x)[0].get("label", None))
                                  ),
             
            )
)

We could also refactor it a bit

...
# a bit faster than the previous but still slow

def emotion_func(text:str) -> str:
    if not text:
        return None
    return emotion(text)[0].get("label", None)
    



dataf = (dataf
         .head(50) # comment this out for the whole dataset
         .assign(Emotion = lambda d: (d["Review Text"]
                                        .map(emotion_func)
                                     ),

            )
)

Results:

    Review Text Emotion
0   Absolutely wonderful - silky and sexy and comf...   admiration
1   Love this dress! it's sooo pretty. i happene... love
2   I had such high hopes for this dress and reall...   fear
3   I love, love, love this jumpsuit. it's fun, fl...   love
...
6   I aded this in my basket at hte last mintue to...   admiration
7   I ordered this in carbon for store pick up, an...   neutral
8   I love this dress. i usually get an xs but it ...   love
9   I'm 5"5' and 125 lbs. i ordered the s petite t...   love
...
16  Material and color is nice. the leg opening i...    neutral
17  Took a chance on this blouse and so glad i did...   admiration
...
26  I have been waiting for this sweater coat to s...   excitement
27  The colors weren't what i expected either. the...   disapproval
...
31  I never would have given these pants a second ...   love
32  These pants are even better in person. the onl...   disapproval
33  I ordered this 3 months ago, and it finally ca...   disappointment
34  This is such a neat dress. the color is great ...   admiration
35  Wouldn't have given them a second look but tri...   love
36  This is a comfortable skirt that can span seas...   approval
...
40  Pretty and unique. great with jeans or i have ...   admiration
41  This is a beautiful top. it's unique and not s...   admiration
42  This poncho is so cute i love the plaid check ...   love
43  First, this is thermal ,so naturally i didn't ...   love

Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57
  • Prayson, Thanks a lot, it works ! I have a query though, is there anyway to get the whole dataset in output with this additional emotion column – Shrumo Oct 27 '21 at 08:28
  • Yes, above query does that. if you comment out `.head()`. The `.assign` functions added a new column – Prayson W. Daniel Oct 27 '21 at 08:56
  • okay, got it. what does this part of the code do? - .get("label", None)) – Shrumo Oct 27 '21 at 10:15
  • the resutl is in form of `[{'label':'emotion_name', 'score': float_values}]` . we get the first element, that returns a dictionary that we `.get("label", None)` , get label, if no label, default to None. – Prayson W. Daniel Oct 27 '21 at 10:23
  • There is one issue that i noticed with this model. The default number of tokens seems to be set at 512. Is there any way to change it in this code? I tried this on dataset that has longer texts.. and i am getting this error : InvalidArgumentError: indices[0,512] = 514 is not in [0, 514) [Op:ResourceGather] – Shrumo Oct 27 '21 at 17:35
  • We can solve that in two ways, `emotion_funct` can take `x` in chunks of `512` If x tokens are greater than that and select the emotion with highest score – Prayson W. Daniel Oct 27 '21 at 19:01
  • I tried the code with `emotion_func` but I still get that error, because my data has one sentence that is crossing the 512 tokens limit. What can I do about this? Will I have to delete that row and re-run the code? or is there someway I can retain those texts and still get emotions output? – Shrumo Oct 28 '21 at 07:09
-2

you can use pickle.

Pickle lets you.. well pickle your model for later use and in fact, you can use a loop to keep training the model until it reaches a certain accuracy and then exit the loop and pickle the model for later use.

You can find many tutorials on youtube on how to pickel a model.

wraient
  • 38
  • 5