-1

I am writing a web crawler in python that downloads a list of URLS, extracts all visible text from the HTML, tokenizes the text (using nltk.tokenize) and then creates a positional inverted index of words in each document for use by a search feature.

However, right now, the index contains a bunch of useless entries like:

1) //roarmag.org/2015/08/water-conflict-turkey-middle-east/

2) ———-

3) ykgnwym+ccybj9z1cgzqovrzu9cni0yf7yycim6ttmjqroz3wwuxiseulphetnu2

4) iazl+xcmwzc3da==

Some of these, like #1, are where URLs appear in the text. Some, like #3, are excerpts from PGP keys, or other random data that is embedded in the text.

I am trying to understand how to filter out useless data like this. But I don't just want to keep words that I would find in an English dictionary, but also things like names, places, nonsense words like "Jabberwocky" or "Rumpelstiltskin", acronyms like "TANSTAAFL", obscure technical/scientific terms, etc ...

That is, I'm looking for a way to heuristically strip out strings that are "jibberish". (1) exceedingly "long" (2) filled with a bunch of punctuation (3) composed of random strings of characters like afhdkhfadhkjasdhfkldashfkjahsdkfhdsakfhsadhfasdhfadskhkf ... I understand that there is no way to do this with 100% accuracy, but if I could remove even 75% of the junk I'd be happy.

Are there any techniques that I can use to separate "words" from junk data like this?

J. Taylor
  • 4,567
  • 3
  • 35
  • 55
  • the most simple way of doing this is to have a dictionary at hand and comparing a given string with different string distance metrics also an idea would be `word2vec` also you could create a few regexs that match words – zython Apr 02 '18 at 13:55
  • 1
    @zython: "string distance metrics" and "simple" do not co-exist. Computing distance metrics for every word turns out to be an incredibly expensive proposition, and gives totally wrong results in a lot of cases. For example, "thqt" has a distance metric of only 1 from "than", but it's definitely not a word in any language. Both regex and distance metrics are good for determining form, but they are terrible when it comes to determining meaning. Consider, for example, the regex to match a 32-bit integer is very complex. – Jim Mischel Apr 02 '18 at 17:01
  • @JimMischel I dont think OP is looking for 100% accuracy and mispellings of words happen, and I also dont think OP is looking for meaning rather than somewhat correct words. also I dont undestand your last point: any 32bit integer will match 32 0s or 1s when converted to binary. nonetheless I can see that your point of view makes sense – zython Apr 02 '18 at 17:36

1 Answers1

2

Excessively long words are trivial to filter. It's pretty easy to filter out URLs, too. I don't know about Python, but other languages have libraries you can use to determine if something is a relative or absolute URL. Or you could just use your "strings with punctuation" filter to filter out anything that contains a slash.

Words are trickier, but you can do a good job with n-gram language models. Basically, you build or obtain a language model, and run each string through the model to determine the likelihood of that string being a word in the particular language. For example, "Rumplestiltskin" will have a much higher likelihood of being an English word than, say, "xqjzipdg".

See https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark for a trained model that might be useful to you in determining if a string is an actual word in some language.

See also NLTK and language detection.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • Thanks Jim. I think n-gram models might have been just what I'm looking for. I'm going to try to implement something using `langdetect` or other n-gram based tools, and will post my solution later if it works out. – J. Taylor Apr 03 '18 at 18:50
  • I'm not really sure if it provided an answer to my question (although it was definitely helpful). But since the post is old and downvoted, and unlikely to be seen by anyone else, I doubt that any more information will be added ... so why not? :) – J. Taylor May 04 '18 at 05:31