How to Quickly Remove Similar Strings?

Question

Right now, I have a dataset of around 70,000 tweets and I'm trying to remove tweets that are overly similar. I have decided to use a Levenshtein ratio of greater than 0.9 as a threshold for tweets that are too similar. However, my current code is essentially comparing every tweet to every other tweet, giving me O(N^2), which is painfully slow. It takes over 12 hours to run, which is just too much. I tried parallelizing it, which is what I'm running right now, but I don't know if that will speed it up to an acceptable degree. Is there some way I can accomplish this in O(N)?

import json
import multiprocessing as mp
from pathlib import Path

import Levenshtein

def check_ratio(similar_removed, try_tweet):
    for tweet in similar_removed:
        if Levenshtein.ratio(try_tweet, tweet[1]) > 0.9:
            return 1
    return 0

def main():
    path = Path('C:/Data/Python/JobLoss')
    orig_data = []
    with open(path / 'Processed.json') as f:
        data = json.load(f)
        for tweet in data:
            orig_data.append(tweet[2])
    pool = mp.Pool(mp.cpu_count())
    similar_removed = []
    similar_removed.append((0, orig_data[0]))
    for i in range(1, len(orig_data)):
        print('%s / %s' % (i, len(orig_data)))
        too_similar = 0
        try_tweet = orig_data[i]
        too_similar = pool.apply(check_ratio, args=(similar_removed, try_tweet))
        if not too_similar:
            similar_removed.append((i, try_tweet))
    pool.close()
    final = [data[ind[0]] for ind in similar_removed]
    with open(path / 'ProcessedSimilarRemoved.json', 'w') as f:
            json.dump(final, f)

if __name__ == '__main__':
    main()

Have you considered training a word to vec model on your data, and then filtering out tweets which lie next to eachother in their embedded space instead? — user2589273, Nov 13 '20 at 14:39
For two strings that are *overly similar* are you discarding both of them? — wwii, Nov 13 '20 at 14:50
[Clustering a long list of strings (words) into similarity groups](https://stats.stackexchange.com/questions/123060/clustering-a-long-list-of-strings-words-into-similarity-groups) - (stats.stackexchange) - lots of books and papers out there. — wwii, Nov 13 '20 at 15:12
@user2589273 wouldn't that still require you to check every pair of tweets? — StormFalcon, Nov 13 '20 at 15:15
I'm considering using some sort of hashing function that gives similar strings similar hashes. That should give me close to O(N) — StormFalcon, Nov 13 '20 at 15:15
[https://towardsdatascience.com/calculating-string-similarity-in-python-276e18a7d33a](https://towardsdatascience.com/calculating-string-similarity-in-python-276e18a7d33a) — wwii, Nov 13 '20 at 15:35
@wwii I've seen that article. I don't see anything in there that would help me get around the issue I'm currently having, which is that I need to calculate the similarity of every pair of tweets, which is O(N^2), and too slow. — StormFalcon, Nov 13 '20 at 16:35
@StormFalcon not if you use a pre-trained model, then it would just be allocating each tweet a vector based on what the model has been trained on. Its just an idea which may work amazingly or not at all. — user2589273, Nov 13 '20 at 17:23
@user2589273 right, but you still need to check that vector with every other vector to see if it's too close to any of them? — StormFalcon, Nov 13 '20 at 18:46
@StormFalcon good point. Is it the distance computation or the filtering that takes up the most time in your code? — user2589273, Nov 13 '20 at 18:53
@user2589273 well it's essentially 70,000 * 70,000 = something really big — StormFalcon, Nov 13 '20 at 18:58
That shouldn't take that long for 70000 values! Are you not sure that a lot of that is due to your conversion into a string and printing to screen? I would also pre allocate an array and edit the value (True/False) as opposed to extending it each time. @StormFalcon — user2589273, Nov 13 '20 at 18:59
You could also consider taking a single value, and scanning cunks of the entire dataset (in parallel). These chunks can then be reduced in size before the next variable scans them — user2589273, Nov 13 '20 at 19:00

score 0 · Accepted Answer · answered Nov 17 '20 at 00:50

I ended up using a method described in the top answer to this question, which uses LSH for sublinear time queries, giving an overall complexity of sub quadratic; fast enough for my needs. Essentially you turn each tweet into a set with k-shingling, then use min hash lsh for a quick way to bucket similar sets together. Then, instead of comparing each tweet to every other tweet, you only need to look in its bucket for matches, making this method considerably faster than using the Levenshtein ratio on all pairs.

How to Quickly Remove Similar Strings?

1 Answers1