I am writing a web crawler in python that downloads a list of URLS, extracts all visible text from the HTML, tokenizes the text (using nltk.tokenize) and then creates a positional inverted index of words in each document for use by a search feature.
However, right now, the index contains a bunch of useless entries like:
1) //roarmag.org/2015/08/water-conflict-turkey-middle-east/
2) ———-
3) ykgnwym+ccybj9z1cgzqovrzu9cni0yf7yycim6ttmjqroz3wwuxiseulphetnu2
4) iazl+xcmwzc3da==
Some of these, like #1, are where URLs appear in the text. Some, like #3, are excerpts from PGP keys, or other random data that is embedded in the text.
I am trying to understand how to filter out useless data like this. But I don't just want to keep words that I would find in an English dictionary, but also things like names, places, nonsense words like "Jabberwocky" or "Rumpelstiltskin", acronyms like "TANSTAAFL", obscure technical/scientific terms, etc ...
That is, I'm looking for a way to heuristically strip out strings that are "jibberish". (1) exceedingly "long" (2) filled with a bunch of punctuation (3) composed of random strings of characters like afhdkhfadhkjasdhfkldashfkjahsdkfhdsakfhsadhfasdhfadskhkf ... I understand that there is no way to do this with 100% accuracy, but if I could remove even 75% of the junk I'd be happy.
Are there any techniques that I can use to separate "words" from junk data like this?