1

I am trying to make a simple program that counts words, certain strings and sentences. I have the word counter and I have a counter that counts the certain strings but, I can not figure out how to count the sentences since essentially if I count all the decimals what if there is more than one " .".

so far this is my code..

int count = 0;
// while there is something in the file, keep reading and counting
while (inputFile.hasNext()) {
   String token = inputFile.next();
   count++;
}

int letters = 0;
Scanner scanner = new Scanner(file);
while (scanner.hasNextLine()) {
    String nextToken = scanner.next();
    if (nextToken.equalsIgnoreCase("for"))
    {
    letters++;
    }

}
  • Every time you encounter a dot, you test the characters to either side of it, or at least the character immediately following. If it's a space, it's likely to be a period, signaling the end of a sentence. If there's a numeral, then it's a decimal point, and part of a number. You're going to have to use some regular expression patterns, but nothing too heavy. You also have to decide whether and how you want to account for bad typists who might write a sentence.Like this... – MarsAtomic Oct 17 '14 at 22:56
  • What do you mean by "what if there is more than one ".""? Can you give an example of input that would cause this problem? (I know what _I_ think would cause a problem, but I wanted to better understand what you were trying to say.) – ajb Oct 17 '14 at 22:56
  • 1
    "Today, when I was in St. Louis, I met Mr. Paul Carlson, head of U.S. operations for the J. Crew company." OK, so you probably won't get everything right, but you'll need to come up with some idea of which dots you will treat as ending a sentence and which ones you won't. – ajb Oct 17 '14 at 23:00

1 Answers1

0

There are some answers here - Java simple sentence parser using StringTokenizer, regex, BreakIterator, whatever- but the real story is: identifying sentences is not a trivial task, if you want to really find them. Just think on a real long sentence using quotes and numbers together.

There are several libraries you can try, from Sentence Parser to NLP more complex ones such as lingpipe, weka and gate. (see http://www.quora.com/What-are-the-best-Java-open-source-NLP-toolkits)

It all depends on how deep you want to go on this.

Community
  • 1
  • 1
Leo
  • 6,480
  • 4
  • 37
  • 52