I have an interesting problem. I have a list of billions of URLs. Something like:
www.fortune.com
www.newyorktimes.com
www.asdf.com
I also have an English dictionary as a JSON file. https://github.com/dwyl/english-words. How can I count the number of English words detected in the URL?
For example, for the URLS above, the counts should be: 1,3,0 for the words (fortune, new york times). The ideal output is a Pandas dataframe with the URLs and the count of English words in the URL.
The problem is challenging because there isn't a delimiter between words in the URL. It's also kind of a brute force search.