3

I'm working on a project that takes headlines from newspaper's websites that I have stored in 2 text files (nyt.text and wapo.text) and compares them against each other, and if the strings are determined to be similar by the SequenceMatcher built-in for Python, prints them to me along with their similarity rating:

from difflib import SequenceMatcher

f = open('nyt.text','r+')
w = open('wapo.text','r+')

def similar(a, b):
return SequenceMatcher(None, a, b).ratio()


def compare(self):
    wapo = []
    times = []
    for line in w.readlines():
        wapo.append(line)
    for i in f.readlines():
        times.append(i)
    print(wapo[0],times[0])
    for i in wapo:
        for s in times:
            print(similar(i,s))
            if similar(i,s) > 0.35:
                print(i,s)
    return

compare()

The result that I'm getting looks something like this:

    Attorney for San Bernardino gunman's family floats hoax theory
 Op-Ed Contributor: A Battle in San Bernardino

San Bernardino attacker pledged allegiance to Islamic State leader, officials say
 Sunday Routine: How Jamie Hodari, Workplace Entrepreneur, Spends His Sundays

Why some police departments let anyone listen to their scanner conversations - even criminals
 White House Seeks Path to Executive Action on Gun Sales

Why the Pentagon opening all combat roles to women could subject them to a military draft
 Scientists Seek Moratorium on Edits to Human Genome That Could Be Inherited

Destroying the Death Star was a huge mistake
 Mark Zuckerberg Defends Structure of His Philanthropic Outfit

As you can see, they're not too terribly similar besides the first one, despite being rated at .35 similarity by the SequenceMatcher. I have an inkling that this is because the SequenceMatcher judges similarity by letter, not by word. Would anyone have ideas on how to tokenize the words in the titles such that SequenceMatcher reads them as whole words, instead of as individual letters?

MattDMo
  • 100,794
  • 21
  • 241
  • 231
n1c9
  • 2,662
  • 3
  • 32
  • 52

1 Answers1

1

Your intuition here is likely spot on. You're seeing matching based on uninterrupted string of matched letters, which is generally a pretty poor metric for headline similarity.

It's doing this because the sequence you're passing in is a string, or as the computer sees it, a really long list of letters.

If you want to judge on words instead I would suggest splitting the text using the .split() function, which will just split on whitespace.

There's a lot of cleaning you can and probably should do, such as removing punctuation, setting everything to lowercase ('.lower()'), as well as potentially stemming the words to get reasonable matches. That said, all of those pieces are well documented elsewhere and might not make sense for your particular use case.

You can also look at other tokenizers in sklearn, but they're unlikely to make a huge difference here.

Community
  • 1
  • 1
Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144