2

I'm pretty out of my depth here — hoping this is alright to post. I have a list of 1000 or so headlines. I'm trying to identify headlines that are about the same thing but worded differently.

Hoping to be pointed in the direction of the least difficult way to do this, find out if there are any existing tools out there for this, find relevant tutorials, etc. I've been Googling but haven't found anything on this specifically, possibly because I'm missing the vocab to describe it. (In an ideal world, there's be some online tool for this that I wouldn't have to code, but will try and code if necessary.) Thanks.

codi6
  • 516
  • 1
  • 3
  • 18
  • Does this answer your question? [how to compare two strings by meaning?](https://stackoverflow.com/questions/59413960/how-to-compare-two-strings-by-meaning). Once you can compare them by meaning, finding simaler ones is easy (iterate through the list of headlines, find ones that are within some threshold of simaler-ness) – cocomac May 23 '22 at 03:05

1 Answers1

1

One way you could solve this, at least to a rough approximation:

  1. Count the total number of occurrences of each word in the entire list.
  2. Group together words that share the same root. E.g. walks, walking, walked. Add together those word counts.
  3. Sort this frequency list in most common word order.
  4. Sort the headlines by the most occurrences of word-group 1 in the frequency list. (For the set of headlines that contain it at least once.)
  5. Repeat (4) for word-group 2 in the frequency list, and so on through the end of the frequency list.
  6. You would now have a short list of related headlines from each word-group. Browse some of these yourself to see if there are some meaningfully similar ones.
brobers
  • 368
  • 1
  • 10