i am doing a project wherein i have to extract nouns adjectives noun phrases and verbs from text files(.doc) format. i have a corpus of around 75 such files. i have accessed net to find about it and i came across POS tagging in python using nltk. as my project is in c# (using visual studio 2008) i need a code to do so. i have tried wordnet api for the same and even sharpnlp but as i am a newbie i found these tough to integrate with my project. can anybody please suggest me simpler code to do so using something like vocabulary etc. plz help me guys. thanx.
-
This is a whole research area of its own. Not sure if there are any easy to use libraries for C#. An example of an online resource (which I think is open source) can be found at http://barbar.cs.lth.se:8081/ with further information at http://code.google.com/p/mate-tools/ However, this is not done in C# but can perhaps provide some ideas. – Mikael Nov 12 '10 at 12:46
-
I've used SharpNLP that you mention in a school-project a couple of years ago. I remember that it worked pretty good so I definitely recommend you to check it out. If you are new to natural-language parsing you'll need to dedicate some time though, in order to understand parse trees etc. http://www.codeproject.com/KB/recipes/englishparsing.aspx – Ozzy Nov 12 '10 at 13:14
-
@Ozzy I don't think the OP is worried about parses, tagging typically comes before the parses (and after tokenization), so if all he needs are the POS tags, he'll never have to worry about it. – Chris Pfohl Nov 12 '10 at 13:18
2 Answers
I worked in NLP (Natural Language Processing) for an industry leader for a while and what you want to do is no trivial task. I know one of the creators of nltk
and I have used it myself; it's a high quality open source tool and I'd recommend you use it (do you have a particularly compelling reason to use C#?)
POS tagging is typically implemented by training a model of language on hand-annotated data, then applying that model to new text, predicting the parts of speech and giving a confidence . nltk
has tools that do this, and they also have some models (if I'm not mistaken).
You'll find that most tools are written in C++, Java, and Python. If you don't know any of the languages look at this as an excellent opportunity to learn something!
See Wikipedia, especially the links at the bottom, for more information and other software available to use for such tagging.

- 18,220
- 9
- 68
- 111
Christopher is correct in his statement that NLP implementations are no picnic. However, I've recently looked into a viable solution using OpenNLP in a .NET project with a rudimentary PoS parser. In my example I am looking for noun phrases, but it shouldn't be too difficult a text to find other fragments as well. I find the OpenNLP Tools Models for 1.5 to be sufficient for my purposes.
I realize this answer is woefully late for the questioner, but hopefully it will give others some inspiration with this difficult field to get into.
Extracting noun phrases with contextual relevance in .NET using OpenNLP

- 1,766
- 18
- 31