I'm working on a project that takes headlines from newspaper's websites that I have stored in 2 text files (nyt.text
and wapo.text
) and compares them against each other, and if the strings are determined to be similar by the SequenceMatcher
built-in for Python, prints them to me along with their similarity rating:
from difflib import SequenceMatcher
f = open('nyt.text','r+')
w = open('wapo.text','r+')
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
def compare(self):
wapo = []
times = []
for line in w.readlines():
wapo.append(line)
for i in f.readlines():
times.append(i)
print(wapo[0],times[0])
for i in wapo:
for s in times:
print(similar(i,s))
if similar(i,s) > 0.35:
print(i,s)
return
compare()
The result that I'm getting looks something like this:
Attorney for San Bernardino gunman's family floats hoax theory
Op-Ed Contributor: A Battle in San Bernardino
San Bernardino attacker pledged allegiance to Islamic State leader, officials say
Sunday Routine: How Jamie Hodari, Workplace Entrepreneur, Spends His Sundays
Why some police departments let anyone listen to their scanner conversations - even criminals
White House Seeks Path to Executive Action on Gun Sales
Why the Pentagon opening all combat roles to women could subject them to a military draft
Scientists Seek Moratorium on Edits to Human Genome That Could Be Inherited
Destroying the Death Star was a huge mistake
Mark Zuckerberg Defends Structure of His Philanthropic Outfit
As you can see, they're not too terribly similar besides the first one, despite being rated at .35 similarity by the SequenceMatcher. I have an inkling that this is because the SequenceMatcher judges similarity by letter, not by word. Would anyone have ideas on how to tokenize the words in the titles such that SequenceMatcher reads them as whole words, instead of as individual letters?