Find the most similar terms from a list of given terms in a huge text corpora

Question

I have a 2-million long list of names of Podcasts. Also, I have a huge text corpus scraped from a sub-Reddit (Posts, comments, threads etc.) where the podcasts from our list are being mentioned a lot by the users. The task I'm trying to solve is, I've to count the number of mentions by each name in our corpora. In other words, generate a dictionary of (name: count) pairs.

The challenge here is that most of these Podcast names are several words long, For eg: "Utah's Noon News"; "Congress Hears Tech Policy Debates" etc. However, the mentions which Reddit users make are often a crude substring of the original name, for eg: "Utah Noon/ Utah New" or "Congress Tech Debates/ Congress Hears Tech". This makes identifying names from the list quite difficult.

What I've Tried: First, I processed and concatenated all the words in the original podcast names into a single word. For instance, "Congress Hears Tech Policy Debates" -> "Congresshearstechpolicydebates"

As I traversed the subreddit corpus, whenever I found a named-entity or a potential podcast name, I processed its words like this,
"Congress Hears Tech" (assuming this is what I found in the corpora) -> "congresshearstech"

I compared this "congresshearstech" string to all the processed names in the podcast list. I make this comparison using scored calculated on word-spelling similarity. I did this using difflib Python library. Also, there are similarity scores like Leveshtein and Hamming Distance. Eventually, I rewarded the podcast name with similarity score maximum to our corpus-found string.

My problem: The thing is, the above strategy is infact working accurately. However, it's way too slow to do for the entire corpus. Also, my list of names is way too long. Can anyone please suggest a faster algorithm/data structure to compare so many names on such a huge corpus? Is there any deep learning based approach possible here? Something like where I can train a LSTM on the 2 million Podcast names. So, that whenever a possible name is encountered, this trained model can output the closest spelling of any Podcast from our list?

I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in https://stackoverflow.com/tags/machine-learning/info — desertnaut, Oct 31 '21 at 22:54

score 1 · Accepted Answer · answered Oct 30 '21 at 22:33

You may be able to use something like tf-idf and cosine similarity to solve this problem. I'm not familiar with any approach to use machine learning that would be helpful here.

This article gives a more detailed description of the process and links to some useful libraries. You should also read this article which describes a somewhat similar project to yours and includes information on improving performance. I'll describe the method as I understand it here.

tf-idf is an acronym meaning "term frequency inverse document frequency". Essentially, you look at a subset of text and find the frequency of the terms in your subset relative to the frequency of those terms in the entire corpus of text. Terms that are common in your subset and in the corpus as a whole will have a low value, whereas terms that are common in your subset but rare in the corpus would have a high value.

If you can compute the tf-idf for a "document" (or subset of text) you can turn a subset of text into a vector of tf-idf values. Once you have this vector you can use it to compute the cosine-similarity of your text subset with other subsets. Say, find the similarity of an excerpt from reddit with all of your titles. (There is a way to manage this so you aren't continuously checking each reddit excerpt against literally every title - see this post).

Once you can do this then I think the solution is to pick some value n, and scan through the reddit posts n words at a time doing the tf-idf / cosine similarity scan on your titles and marking matches when the cosine-similarity is higher than a certain value (you'll need to experiment with this to find what gives you a good result). Then, you decrement n and repeat until n is 0.

grov · Answer 2 · 2021-10-31T03:08:19.497

If exact text matching (with or without your whitespace removal preprocessing) is sufficient, consider the Aho-Corasick string matching algorithm for detecting substring matches (i.e. the podcast names) in a body of text (i.e. the subreddit content). There are many implementations of this algorithm for python, but ahocorapy has a good readme that summarizes how to use it on a dataset.

If fuzzy matching is a requirement (also matching when the mention text of the podcast name is not an exact match), then consider a fuzzy string matching library like thefuzz (aka fuzzywuzzy) if per query-document operations offer sufficient performance. Another approach is to precompute n-grams from the substrings and accumulate the support counts across all n-grams for each document as the fuzzyset package does.

If additional information about the podcasts is available in a knowledge base (i.e. more than just the name is known), then the problem is more like the general NLP task of entity linking but to a custom knowledge base (i.e. the podcast list). This is an area of active research and state of the art methods are discussed on NLP Progress here.

Thank you so much for your input, @grov. Yes! "Fuzzy matching" is the idea I wanted to convey but I wasn't aware of this term. As I said, the difflib library is pretty accurate at identifying the closest word. However, my concern is time efficiency. Each time I have to iterate over 2 million names to identify a single occurrence in the subreddit. Could there be a data structure that makes this word-access faster? Or any faster method? — Shivam Arya Jha, Oct 30 '21 at 20:02
Hi @Shivram, I edited the answer to include a reference to the fuzzyset package. I also found [this stackoverflow discussion](https://stackoverflow.com/questions/17740833/checking-fuzzy-approximate-substring-existing-in-a-longer-string-in-python) to be helpful. — grov, Oct 31 '21 at 03:10

Find the most similar terms from a list of given terms in a huge text corpora

2 Answers2