25

I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.

I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.

Are there APIs for text analysis in Java?

EDIT: Text-mining, I want to mining the text. An API for Java that provides this.

Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199
  • 1
    There were some great answers on this thread http://stackoverflow.com/questions/3778388/java-text-analysis-libraries – crowne Jul 25 '11 at 19:23
  • I want to extract movies informations from downloaded pages. Things like title, actors, year, director, etc. – Renato Dinhani Jul 26 '11 at 21:23
  • @Renato Dinhani Conceição Do all of your downloaded pages have common html structure? (templated web pages?) – stemm Jul 29 '11 at 11:04
  • @stemm Yes, all of them are HTML. I'm avoiding another types. – Renato Dinhani Jul 29 '11 at 14:49
  • @Renato Dinhani Conceição I meant: Does all of your pages have templated html structure? To extract important information from text, you need to find most informative parts in it. For example - if your pages has templated structure, it would be more simple than cope with raw text. – stemm Jul 29 '11 at 15:30
  • @stemm Now I understand, and not, is not structured. Are random pages from various pages that a robot crawls (and be random is intentional). I will try to extract infos from the HTML structure too, but it is not structured to discard the text mining. – Renato Dinhani Jul 29 '11 at 16:10
  • @Renato Dinhani Conceição look at the edit of my answer (some ideas about parser). – stemm Jul 30 '11 at 19:39

5 Answers5

26

It looks like you're looking for a Named Entity Recogniser.

You have got a couple of choices.

CRFClassifier from the Stanford Natural Language Processing Group, is a Java implementation of a Named Entity Recogniser.

GATE (General Architecture for Text Engineering), an open source suite for language processing. Take a look at the screenshots at the page for developers: http://gate.ac.uk/family/developer.html. It should give you a brief idea what this can do. The video tutorial gives you a better overview of what this software has to offer.

You may need to customise one of them to fit your needs.

You also have other options:


In terms of training for CRFClassifier, you could find a brief explanation at their FAQ:

...the training data should be in tab-separated columns, and you define the meaning of those columns via a map. One column should be called "answer" and has the NER class, and existing features know about names like "word" and "tag". You define the data file, the map, and what features to generate via a properties file. There is considerable documentation of what features different properties generate in the Javadoc of NERFeatureFactory, though ultimately you have to go to the source code to answer some questions...

You can also find a code snippet at the javadoc of CRFClassifier:

Typical command-line usage

For running a trained model with a provided serialized classifier on a text file:

java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier conll.ner.gz -textFile samplesentences.txt

When specifying all parameters in a properties file (train, test, or runtime):

java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile

To train and test a simple NER model from the command line:

java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile trainFile -testFile testFile -macro > output

Community
  • 1
  • 1
William Niu
  • 15,798
  • 7
  • 53
  • 93
10

For example - you might use some classes from standard library java.text, or use StreamTokenizer (you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer - java.text and other standart utils has too limited capabilities in such context.

So, I'd advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.

P.S. According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).

enter image description here

stemm
  • 5,960
  • 2
  • 34
  • 64
8

If you're dealing with large amounts of data, maybe Apache's Lucene will help with what you need.

Otherwise it might be easiest to just create your own Analyzer class that leans heavily on the standard Pattern class. That way, you can control what text is considered a word, boundary, number, date, etc. E.g., is 20110723 a date or number? You might need to implement a multiple-pass parsing algorithm to better "understand" the data.

scott
  • 974
  • 7
  • 10
2

I recommend looking at LingPipe too. If you are OK with webservices then this article has a good summary of different APIs

sumit
  • 436
  • 3
  • 15
2

I'd rather adapt Lucene's Analysis and Stemmer classes rather than reinventing the wheel. They have a vast majority of cases covered. See also the additional and contrib classes.

Michael-O
  • 18,123
  • 6
  • 55
  • 121