4

If I've got a block of text, in English, what's the best method of clearing away all the "filler" words like "the, it, or, we, us", etc... leaving only viable words to be considered the real, core, content of the text?

I'm brainstorming a way to automatically tie blocks of text together based on how similar they are in keyword composition.

I can't be the first one to imagine this. Is there a popular, effective way this can be accomplished using C#?

Update

I am trying to essentially link one block of text, to n "related" blocks of text, where the primary "content" is so similar that it could be considered additional information to the text it is related to...

Chaddeus
  • 13,134
  • 29
  • 104
  • 162
  • Although I am not a native English speaker, I doubt that "the" is just a filler word. Consider: “No, not _the_ Zaphod Beeblebrox, _a_ Zaphod Beeblebrox. Didn't you hear I come in six-packs now?” – Vlad Jun 21 '12 at 10:51
  • @Vlad - I would say it is filler in the sense that it does not help at all to infer the topic of a sentence. – Rotem Jun 21 '12 at 10:52
  • partial list of words and phrases you'd want to eliminate: http://www.smart-words.org/transition-words.html – Rotem Jun 21 '12 at 10:53
  • @Rotem: well, without _the_, the whole point of the sentence would be completely unclear. – Vlad Jun 21 '12 at 10:54
  • @Vlad The question is if the OP is trying to infer the meaning of a sentence or the topic. Even without the the, I can infer the sentence deals with Zaphod BeebleBrox and/or six-packs. – Rotem Jun 21 '12 at 10:55
  • Phwew, re-reading my update, did I just make my thought more difficult to understand? :) @Rotem, yes, I am (in a sense) trying to infer the meaning of the text. I'm not trying to remove the filler words, which could have an impact on humans reading the text. – Chaddeus Jun 21 '12 at 12:45

3 Answers3

5

This thing is called stop words - words that are usually1 not essential for understanding the data, and are removed by indexers.

Almost any Information Retrieval system I am aware of implements a tokenizer that filter these words.

I am familiar with java's lucene, that has StandardAnalyzer that does it for you, but I assume this analyzer also exists in lucene.net - you may want to track it and use it.

You might also be interested in stemming, which is also done in lucene by EnglishAnalyzer for instance.


(1) Why usually? In sarcasm ditactors, for example - it seems (empirically) that stop words are critical to get good results.

amit
  • 175,853
  • 27
  • 231
  • 333
  • I'm not entirely sure that the poster does want to filter out stop words (as opposed to some other hand crafted list of words), however the same techniques used to filter out stop words could certainly be used to filter out any other list of words. – Justin Jun 21 '12 at 11:02
  • @Justin: The java interface (And I can only assume also the C#) of the analyzers in lucene - allows one to manually gives a list of stop words to be used by the analyzer. – amit Jun 21 '12 at 11:04
  • @amit, thank you! I've got some reading to do now, but this looks like what I'm wanting. I simply want to relate blocks of text to each other, based on their real content (not the stop words). Thanks! – Chaddeus Jun 21 '12 at 12:48
  • @amit, also I'm using RavenDB in my project... which I believe uses Lucene.net heavily. Perhaps I could knock out a lot of this work on the DB end. Hmm... – Chaddeus Jun 21 '12 at 12:56
3

If you want this to be done in a large scale and if the filter words are going to increase constanly, then you can use NLPs like openNLP

You can use it to remove the prepositions,connectors etc...

gout
  • 802
  • 10
  • 32
2

Create a list of 'filler words'. Replace all occurrences of any element in this list of the original block of text with String.Empty.

string replace using a List<string>

Community
  • 1
  • 1
  • But what if that filler word is inside a good word? It will also get replaced. like words `the` and `themes`? – Nikhil Agrawal Jun 21 '12 at 10:53
  • Split them first? Answer updated to include link that elaborates on this. –  Jun 21 '12 at 10:55
  • This is what I imagined right away... but I wanted to ask just in case there was a better way. Thanks! – Chaddeus Jun 21 '12 at 12:50
  • @Chad - not a problem. I like the look of atmybase.com a lot mate, looking good. There's a bug on the search (when the textbox is empty) but I'm assuming you know that. Could be a winner tho, don't know of anything else like that. –  Jun 21 '12 at 14:14
  • Thanks @Daniel... the next version I'm working on now will integrate Foursquare, Instagram, Facebook, and SocialCam to help "describe" a location (and make it "social"). The text part I'm working on now is to make it so articles of information the military members share about their base relate together automatically without requiring them to associate it themselves (and relate better than "they're in the same category"). Thanks again! – Chaddeus Jun 21 '12 at 14:20