6

I have a list of 10 million domains and want to be able to programmatically separate the english words in the domains, something like:

getheadphones.com results in "get headphones"

I know that when i put getheadphones in Google I get "get headphones" but not sure how they do that and how they know that it is not "get head phones"

Any ideas? Preferably in php.

vzwick
  • 11,008
  • 5
  • 43
  • 63
iwek
  • 1,608
  • 5
  • 16
  • 31
  • 1
    Hey, where'd you get the list from? – Caffeinated Oct 26 '11 at 01:23
  • 4
    I suspect google uses [n-gram](http://en.wikipedia.org/wiki/N-gram) among other algorithms to find the largest words out of the glommed value. As for headphones vs head phones, I'd assume word frequency but beyond the assumption, I'm way out of my league. – billinkc Oct 26 '11 at 01:32
  • 3
    They don't *know* it's not "get head phones", they *assume* it's "get headphones". – Dave Newton Oct 26 '11 at 01:39
  • billinkc, thank you for the n-gram link – iwek Oct 26 '11 at 01:44
  • 1
    possible duplicate of [How can I split multiple joined words?](http://stackoverflow.com/questions/195010/how-can-i-split-multiple-joined-words) – NotMe Oct 26 '11 at 02:04

1 Answers1

0

google is famous for their spell checker and it does far more to figure out what you mean to search for, however this problem has already been dealt with in this question

to get list of english words in OSX and some linux boxes there's one available: /usr/share/dict/words otherwise you can get one from (sourceforge)

Community
  • 1
  • 1
uncreative
  • 1,436
  • 9
  • 14