0

I want to do a Topic modelling but in my case : One article may contains many topic: I have an article (word file) that contains several topics and each topic is associated with a company (see example below)

I have a text as input :

"IBM is an international company specializing in all that is IT, on the other hand Facebook is a social network and Google is a search engine. IBM invented a very powerful computer."

Knowing we have labeled topics : "Products and services","Communications","Products and services"...

I want to have as output:

IBM : Products and services
Facebook : Communications
Google : Products and services

So, I think that we can do this by splitting the text: associate the parts of the text that talks about company, for example :

IBM : ['IBM is an international company specializing in all that is IT', 'IBM invented a very powerful computer.']
Facebook : ['Facebook is a social network']
Google : ['Google is a search engine']

then, for each company, perform Topic Modelling based on parts of text for each company ... OUTPUT:

IBM : Products and services
Facebook : Communications
Google : Products and services

Could you help me how I can split and match the parts of text to each company, how to determine the parts that talk about Facebook in

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72

1 Answers1

2

It seems like you have two separate problems: (1) Data preparation/cleaning, i.e. splitting your text into the right units for analysis; (2) classifying the different units of text into "topics".

1. Data Preparation

An 'easy' way of doing this would be splitting your text into sentences and use sentences as your unit of analysis. Spacy is good for this for example (see e.g. this answer here). Your example is more difficult since you want to split sentences even further, so you would have to come up with a custom logic for splitting your text according to specific patterns, e.g. using regular expressions. I don't think that there is a standard way for doing this and is depends very much on your data.

2. Topic classification

If I understand correctly, you already have the labels ("topics" like ["Products and services", "Communications"]) which you want to attribute to different texts. In this case, topic modeling is probably not the right tool, because topic modeling is mostly used when you want to discover new topics and don't know the topics/labels yet. And in any case, a topic model would only return the most frequent/exclusive words associated to a topic and not a neat abstract topic label like "Products and services". You also need enough text for a topic model to produce meaningful output.

A more elegant solution is zero-shot classification. This basically means that you take a general machine learning model that has been pre-trained by someone else in a very general way for text classification and you simply apply it to your specific use case for "topic classification" without having to train/fine-tune it. The Transformers library has a very easy to use implementation of this.

# pip install transformers==3.1.0  # pip install in terminal
from transformers import pipeline

classifier = pipeline("zero-shot-classification")

sequence1 = "IBM is an international company specializing in all that is IT"
sequence2 = "Facebook is a social network."
sequence3 = "Google is a search engine. "
candidate_labels = ["Products and services", "Communications"]

classifier(sequence1, candidate_labels)
# output: {'labels': ['Products and services', 'Communications'], 'scores': [0.8678683042526245, 0.1321316659450531]}
classifier(sequence2, candidate_labels)
# output: {'labels': ['Communications', 'Products and services'], 'scores': [0.525628387928009, 0.47437164187431335]}
classifier(sequence3, candidate_labels)
# output: {'labels': ['Products and services', 'Communications'], 'scores': [0.5514479279518127, 0.44855210185050964]}

=> it classifies all texts correctly based on your example and labels. The label ("topic") with the highest score is the one which the model thinks fits best to your text. Note that you have to think hard about which labels are the most suitable. In your example, I wouldn't even be sure as a human which one fits better and the model is also not very sure. With this zero-shot classification approach you can chose the topic labels that you find most adequate.

Here is an interactive web application to see what it does without coding. Here is a Jupyter notebook which demonstrates how to use it in Python. You can just copy-paste code from the notebook.

Moritz
  • 2,835
  • 2
  • 6
  • 12