I assume you have a text file and not a complex document like MS-Word or RTF.
The concept of paragraph in text document is not well defined. Most cases new paragraph will be recognized by the fact that when you open a document in text editor, you will see next set of text starting on next line.
There are two special characters viz. new-line (LF - '\n'
) and carriage-return (CR - '\r'
) that causes the text to start on next line. Which character is used for next line depends on operating system you use. Further more, sometimes combination of both is also used like CRLF ('\r\n'
).
In java you can determine character or set of characters used to seprate lines/paragraphs using System.getProperty("line.separator");
. But this brings in new problem. What if you create a text file in MS Windows and then open it in Unix? Line seprator in text file in this case is that of windows, but java is running on unix.
.
My recommendation is:
IF length of text(docuemnt) is zero, THEN paragraphs = 0.
IF length of text(docuemnt) is NOT zero, THEN
- Consider
'\n'
and '\r'
as line
break characters.
- Scan your text for above line break
characters.
- Any continious line break characters
in any order should be considered as
one paragraph break.
- Number of paragraphs = 1 + (count of
paragraph breaks)
Note, exceptions pointed by Stephen still applies here as well.
.
public class ParagraphTest {
public static void main(String[] args) {
String document =
"Hello world.\n" +
"This is line 2.\n\r" +
"Line 3 here.\r" +
"Yet another line 4.\n\r\n\r" +
"Few more lines 5.\r";
printParaCount(document);
}
public static void printParaCount(String document) {
String lineBreakCharacters = "\r\n";
StringTokenizer st = new StringTokenizer(
document, lineBreakCharacters);
System.out.println("ParaCount: " + st.countTokens());
}
}
Output
ParaCount: 5