0

I am building a tool from scratch that takes a sample of text and turns it into a list of categories. I am not using any libraries for this at the moment but am interested if anyone has experience in this territory as the hardest part that I'm struggling with is building in sentiment to the search. It's easy to word match but sentiment is much more challenging.

The goal would be to take something like this paragraph;

"Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"

and turn it into

categories = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']

If possible I'd like to end up adding a filter for negative sentiment so that if the text said;

"I hate cooking" 

'cooking' wouldn't be included in the categories.

Any help is greatly appreciated. TIA

bubbaspaarx
  • 616
  • 4
  • 15

1 Answers1

1

You seem to have at least two tasks: 1. Sequence classification by topics; 2. Sentiment analysis. [Edit, I only noticed now that you are using Ruby/Rails, but the code below is in Python. But maybe this answer is still useful for some people and the steps can be applied in any language.]

1. For sequence classification by topics, you can either define categories simply with a list of words as you said. Depending on the use-case, this might be the easiest option. If that list of words were too time-intensive to create, you can use a pre-trained zero-shot classifier. I would recommend the zero-shot classifier from HuggingFace, see details with code here.

Applied to your use-case, this would look like this:

# pip install transformers  # pip install in terminal
from transformers import pipeline

classifier = pipeline("zero-shot-classification")

sequence = ["Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"]
candidate_labels = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']

classifier(sequence, candidate_labels, multi_class=True)

# output: 
{'labels': ['photography', 'spain', 'chocolate', 'travel', 'father', 'cooking'],
 'scores': [0.9802802205085754, 0.7929317951202393, 0.7469273805618286, 0.6030028462409973, 0.08006269484758377, 0.005216470453888178]}

The classifier returns scores depending on how certain it is that a each candidate_label is represented in your sequence. It doesn't catch everything, but it works quite well and is fast to put into practice.

2. For sentiment analysis you can use HuggingFace's sentiment classification pipeline. In your use-case, this would look like this:

classifier = pipeline("sentiment-analysis")
sequence = ["I hate cooking"]
classifier(sequence)

# Output
[{'label': 'NEGATIVE', 'score': 0.9984041452407837}]

Putting 1. and 2. together: I would probably probably (a) first take your entire text and split it into sentences (see here how to do that); then (b) run the sentiment classifier on each sentence and discard those that have a high negative sentiment score (see step 2. above) and then (c) run your labeling/topic classification on the remaining sentences (see 1. above).

Moritz
  • 2,835
  • 2
  • 6
  • 12
  • 1
    Thank you so so much for this. I can't emphasise how valuable the information above is to me right now. This is all very new to me and I've got one heck of a steep climb to get a base understanding of it. Really appreciated – bubbaspaarx Nov 02 '20 at 12:38