0

I have an interesting problem. I have a list of billions of URLs. Something like:

www.fortune.com
www.newyorktimes.com
www.asdf.com

I also have an English dictionary as a JSON file. https://github.com/dwyl/english-words. How can I count the number of English words detected in the URL?

For example, for the URLS above, the counts should be: 1,3,0 for the words (fortune, new york times). The ideal output is a Pandas dataframe with the URLs and the count of English words in the URL.

The problem is challenging because there isn't a delimiter between words in the URL. It's also kind of a brute force search.

user2205916
  • 3,196
  • 11
  • 54
  • 82
  • What about compound words? Are they treated as one or two? I.e. is `rainbow.com` one or two words? – Primusa Sep 09 '19 at 00:00
  • i think my dictionary will have: rain, bow, and rainbow. So it will count as 3 words, which is fine. Alternatively, if a match is found, and length(match) = length(url), then the search can proceed to the next URL. – user2205916 Sep 09 '19 at 01:12
  • Note that "for", "or", "tune", "fort", "time", "me", "as", etc. are also all english words. Getting the counts is not terribly difficult (although somewhat computationally intensive), but you might want to revisit whether that's really what you want. – Nick Bastin Sep 09 '19 at 01:18
  • for a first iteration, that will be fine. worst case, i can manually edit my dictionary to remove useless words. – user2205916 Sep 09 '19 at 01:20
  • possible duplicate of the below post: https://stackoverflow.com/questions/195010/how-can-i-split-multiple-joined-words – eugen Sep 09 '19 at 05:00
  • Maybe something like https://github.com/alvations/mini-segmenter ? – alvas Sep 09 '19 at 10:02

1 Answers1

0

This might not be the best way, but the most fun way would be to train a seq2seq model. Take sections of real text, and have the training data pairs be (section of text with spaces removed, original section of text with spaces). Make sure to have organization and product names as training examples. I think this could get pretty good accuracy, but that’s just intuition.

However, if you're more of a traditional data structures and algorithms type, you could build a trie from your vocabulary. As you read the characters between "www." and ".com", you travel down the trie. When you reach an end node, you insert a space and then go through the remaining characters.

Sam H.
  • 4,091
  • 3
  • 26
  • 34