7

i'm working on a project that consists of a website that connects to the NCBI(National Center for Biotechnology Information) and searches for articles there. Thing is that I have to do some text mining on all the results. I'm using the JAVA language for textmining and AJAX with ICEFACES for the development of the website. What do I have : A list of articles returned from a search. Each article has an ID and an abstract. The idea is to get keywords from each abstract text. And then compare all the keywords from all abstracts and find the ones that are the most repeated. So then show in the website the related words for the search. Any ideas ? I searched a lot in the web, and I know there is Named Entity Recognition,Part Of Speech tagging, there is teh GENIA thesaurus for NER on genes and proteins, I already tried stemming ... Stop words lists, etc... I just need to know the best aproahc to resolve this problem. Thanks a lot.

de.la.ru
  • 2,994
  • 1
  • 27
  • 32

4 Answers4

5

i would recommend you use a combination of POS tagging and then string tokenizing to extract all the nouns out of each abstract.. then use some sort of dictionary/hash to count the frequency of each of these nouns and then outputting the N most prolific nouns.. combining that with some other intelligent filtering mechanisms should do reasonably well in giving you the important keywords from the abstract
for POS tagging check out the POS tagger at http://nlp.stanford.edu/software/index.shtml

However, if you are expecting a lot of multi-word terms in your corpus.. instead of extracting just nouns, you could take the most prolific n-grams for n=2 to 4

Aditya Mukherji
  • 9,099
  • 5
  • 43
  • 49
  • Could you tell me about the models in POStaggers ? What are they ? How can do I train a POStagger ? Do I have to update the training from time to times ? Where do I get the models ? – de.la.ru May 22 '09 at 16:23
  • i have used their POS tagger a few months back.. you don't have to train anything.. they provide default models which are pretty good.. this models basically specify which words should be labelled with which parts of speech.. u shud start of by downloading it nd following the README instructions to get some sample output i am not sure but i think the tags it uses are the 'word level' tags at http://bulba.sdsu.edu/jeanette/thesis/PennTags.html – Aditya Mukherji May 22 '09 at 19:37
  • later on, you could train models on the kind of text, you are expecting it to be annotating but don't think about that in early stages cos it would be a pretty tedious thing to do You could call these libraries programatically from your java code (i'm not sure of the exact process to do that) or just write a script that calls the script from the command line & stores its output in a file which you then manipulate A simple way to start of would be to do that & then eliminating all closed-class tagged words from your list ( http://en.wikipedia.org/wiki/Closed_class_word ) – Aditya Mukherji May 22 '09 at 19:38
1

This might be relevant as well: https://github.com/jdf/cue.language

It has stop words, word and ngram frequencies, ...

It's part of the software behind Wordle.

Frank Shearar
  • 17,012
  • 8
  • 67
  • 94
fjen
  • 11
  • 1
1

There's an Apache project for that... I haven't used it but, OpenNLP an open source Apache project. It's in the incubator so it maybe a bit raw.

This post from jeff's search engine cafe has a number of other suggestions.

Kevin Williams
  • 2,588
  • 3
  • 24
  • 25
0

I ended up using the Alias`i Ling Pipe

de.la.ru
  • 2,994
  • 1
  • 27
  • 32