I am doing some NLP with Python on YouTube comments I have downloaded, and I only want to process English ones. So far I have experimented with different libraries (many of the ones discussed in this thread) and it works fine for longer strings, but many of the libraries often run into problems with the shorter, one or two worders. My question is whether it would be hopelessly inefficient to download a dictionary of English words and check each of these short, problematic comments against it, obviously discarding the ones that don't match.
I can forsee problems with things such as misspellings or words that appear in English and a foreign language, but at present I am more concerned about speed as I have about 68 million comments to process.