0

I know that in NLP it is a challenge to determine the topic of a sentence or possibly a paragraph. However, I am trying to determine what the title may be for something like a Wikipedia article (of course without using other methods). My only though is finding the most frequent words. For the article on New York City these were the top results:

[('new', 429), ('city', 380), ('york', 361), ("'s", 177), ('manhattan', 90), ('world', 84), ('united', 78), ('states', 74), ('===', 70), ('island', 68), ('largest', 66), ('park', 64), ('also', 56), ('area', 52), ('american', 49)]

From this I can see some sort of statistical significance is the sharp drop from 361 to 177. Regardless, I am neither a statistics or NLP expert (in fact I'm a complete noob at both) so is this a viable way of determining the topic of a longer body of text. If so, what math am I looking for to calculate this? If not is there some other way in NLP to determine the topic or title for a larger body of text? For reference, I am using nltk and Python 3.

Dylan Siegler
  • 742
  • 8
  • 23

2 Answers2

5

You might consider use below algorithms. These are keyword extracting algorithms

TF-IDF

TextRank

Here is a tutorial get you start on using TF-IDF in ntlk

kun
  • 3,917
  • 3
  • 15
  • 21
2

If you have enough data and would like to have topics for a larger body of text like paragraph or an article you can use Topic Modelling methods like LDA.

Gensim has a easy to use implementation of LDA.

Ramtin M. Seraj
  • 686
  • 7
  • 17
  • Can you please provide a link to a tutorial or elaborate more yourself. – Dylan Siegler Jul 26 '16 at 21:02
  • This a step by step tutorial by [gensim](https://radimrehurek.com/gensim/wiki.html) if you are more interested about the way LDA works internally you can check [this](https://www.cs.princeton.edu/~blei/kdd-tutorial.pdf) – Ramtin M. Seraj Jul 27 '16 at 00:43
  • and what if I don't have enough data? if I want to extract the topic based on one sentence, what should I do please? – mina Nov 14 '19 at 20:59
  • Then your best bet is to do parsing and extract all noun phrases. You can use spacy (https://spacy.io/) and get `noun_chunks` from the sentence. see here https://stackoverflow.com/questions/33289820/noun-phrases-with-spacy – Ramtin M. Seraj Nov 14 '19 at 21:26
  • I am using Agglomerative clustering to cluster news headlines Once I get the clusters, I am looking to find the topic of a particular cluster These are only 10-15 sentences on around same topic Please suggest me a way to find the topics in such scenario – Ibtsam Ch Jul 05 '22 at 07:48