I have a 2-million long list of names of Podcasts. Also, I have a huge text corpus scraped from a sub-Reddit (Posts, comments, threads etc.) where the podcasts from our list are being mentioned a lot by the users. The task I'm trying to solve is, I've to count the number of mentions by each name in our corpora. In other words, generate a dictionary of (name: count) pairs.
The challenge here is that most of these Podcast names are several words long, For eg: "Utah's Noon News"; "Congress Hears Tech Policy Debates" etc. However, the mentions which Reddit users make are often a crude substring of the original name, for eg: "Utah Noon/ Utah New" or "Congress Tech Debates/ Congress Hears Tech". This makes identifying names from the list quite difficult.
What I've Tried: First, I processed and concatenated all the words in the original podcast names into a single word. For instance, "Congress Hears Tech Policy Debates" -> "Congresshearstechpolicydebates"
As I traversed the subreddit corpus, whenever I found a named-entity or a potential podcast name, I processed its words like this,
"Congress Hears Tech" (assuming this is what I found in the corpora) -> "congresshearstech"
I compared this "congresshearstech" string to all the processed names in the podcast list. I make this comparison using scored calculated on word-spelling similarity. I did this using difflib Python library. Also, there are similarity scores like Leveshtein and Hamming Distance. Eventually, I rewarded the podcast name with similarity score maximum to our corpus-found string.
My problem: The thing is, the above strategy is infact working accurately. However, it's way too slow to do for the entire corpus. Also, my list of names is way too long. Can anyone please suggest a faster algorithm/data structure to compare so many names on such a huge corpus? Is there any deep learning based approach possible here? Something like where I can train a LSTM on the 2 million Podcast names. So, that whenever a possible name is encountered, this trained model can output the closest spelling of any Podcast from our list?