0

Can we have a tokenizer output on a single line like that of Apache OpenNLP with the command line tool? http://nlp.stanford.edu/software/tokenizer.shtml https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.tokenizer

giorgio79
  • 3,787
  • 9
  • 53
  • 85

1 Answers1

1

You can use DocumentPreprocessor, either programmatically or from the command line.

From the CLI:

$ echo "This is a test. And some more." | java edu.stanford.nlp.process.DocumentPreprocessor 2>/dev/null
This is a test .
And some more .

You can do the same thing programmatically; see this SO answer.

Community
  • 1
  • 1
Jon Gauthier
  • 25,202
  • 6
  • 63
  • 69
  • Thx Jon! I notice the output is tokenized, and I would like to avoid that. Any way to skip tokenization with Stanford NLP? – giorgio79 Feb 12 '15 at 18:48
  • Yes—use whitespace tokenization. Run `DocumentPreprocessor` with the `-help` option for details. – Jon Gauthier Feb 12 '15 at 19:24