4

I've got a regex "[\r\n\f]+" to find the number of lines contained in a String. My code is like this:

pattern = Pattern.compile("[\\r\\n\\f]+")
String[] lines = pattern.split(texts);

In my unit test I've got sample strings like these:

"\t\t\t    \r\n      \n"
"\r\n"

The result of parsing the first string is 2, however it becomes 0 when it's parsing the second string.

I thought the second string includes 1 line although the line is "blank" (suppose I'm editing a file which begins with "\r\n" in a text editor, should the caret be placed at the second line?). Is my regex incorrect for parsing lines? or am I missing something here?

Edit:

I think I'll make the question more obvious:

Why

// notice the trailing space in the string
"\r\n ".split("\r\n").length == 2 // results in 2 strings {"", " "}. So this block of text has two lines.

but

// notice there's no trailing space in the string 
"\r\n".split("\r\n").length == 0 // results in an empty array. Why "" (empty string) is not in the result and this block of text contains 0 lines?
Yu Lu
  • 83
  • 7

2 Answers2

5

From the documentation for Pattern.split(CharSequence):

This method works as if by invoking the two-argument split method with the given input sequence and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.

Many would agree that this behavior is confusingly inconsistent. You can disable the removale of trailing blanks by including a negative limit (all negative values do the same thing):

String[] lines = pattern.split(texts, -1);
Laurence Gonsalves
  • 137,896
  • 35
  • 246
  • 299
  • Oops! This seems break everything. Split ""\t\t\t \r\n \n" gives me 3 and "\r\n" gives me 2, which is even more confusing... – Yu Lu May 30 '14 at 21:53
  • The default behavior of throwing out trailing empty strings is copied from Perl's `split()`. Unfortunately, that's all that was copied. One feature I particularly miss is the ability to toss *all* empty strings. – Alan Moore May 31 '14 at 01:40
  • @Shunshun split deals with separators. That is, things that go *between* things. If you want terminators (pretty typical for line endings) then you need to ignore the last element of it's empty. If it isn't empty you need to decide whether that's an error, or not. Assuming you don't think it's an error, something like this should work: `numLines = lines.length; if (lines.length > 0 && lines[lines.length - 1].isEmpty()) numLines--;` – Laurence Gonsalves May 31 '14 at 06:43
0

What counts as a line really depends on your environment. quote from wikipedia:

LF: Multics, Unix and Unix-like systems (GNU/Linux, OS X, FreeBSD, AIX, Xenix, etc.), BeOS, Amiga, RISC OS and others.

CR: Commodore 8-bit machines, Acorn BBC, ZX Spectrum, TRS-80, Apple II family, Mac OS up to version 9 and OS-9

RS: QNX pre-POSIX implementation. 0x9B: Atari 8-bit machines using ATASCII variant of ASCII. (155 in decimal)

LF+CR: Acorn BBC and RISC OS spooled text output.

CR+LF: Microsoft Windows, DEC TOPS-10, RT-11 and most other early non-Unix and non-IBM OSes, CP/M, MP/M, DOS (MS-DOS, PC DOS, etc.), Atari TOS, OS/2, Symbian OS, Palm OS, Amstrad CPC

Perhaps you should try an arch neutral approach:

    String test = "\t\t\t    \r\n      \n";
    BufferedReader reader = new BufferedReader(new StringReader(test));
    int count = 0;
    String line=null;
    while ((line=reader.readLine()) != null) {
        System.out.println(++count+":"+line);
    }
    System.out.println("total lines == "+count);

Edited to include Alan Moore's note about using .ready()

Andreas
  • 4,937
  • 2
  • 25
  • 35
  • This does gives what I want (with small modification. The while loops is a infinite loop). However, is this the only way to correctly split a block of text by lines without using any third party library? And talking about third party libraries (e.g. Apache Common), what regex are they using to split lines? – Yu Lu May 30 '14 at 22:13
  • I'm not sure this is the *only* correct way nor even if this is the most correct way, but it should be portable to all line endings. It looks like `readLine()` (per java 1.7 docs) is looking for `any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed`. You should be able to make a regex out of that. If, a few years from now, a new file format is in common use, your regex may no longer work. `BufferedReader.readLine()` (hopefully) still will. – Andreas May 30 '14 at 22:21
  • well, not sure if this answers my question or not. Please see my updated question above. – Yu Lu May 30 '14 at 22:43
  • 1
    **OTBI** (Off Topic But Important): That is not how the `ready()` method is meant to be used. Check [this question](http://stackoverflow.com/q/5244839/20938) for details. – Alan Moore May 31 '14 at 01:13