1

I'm running the exact same eclipse project on Ubuntu and on Windows but getting different output.

The unevenly behavior occurs in the following code:

String regex = "<token id=\"(.*)\">.*\n.*<word>(.*)</word>.*\n.*<lemma>(.*)</lemma>.*\n.*\n.*\n.*<POS>(.*)</POS>";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(fileAsString);
while (matcher.find()) {
    ...
}

The (matcher.find()) check return false on Windows but true on Ubuntu (which is the expected behavior).

Eclipse Juno and jdk7 on both.

Maybe it's not related to the operating system, but that's the only different I found after debug parallelly and after check the project's properties in the two environments..

Any idea to the differences???

3 Answers3

4

You're matching \n, which is the line ending for Linux, but not Windows (you need \r\n for Windows). Something like \r?\n would fix your specific problem.

That said, you should never parse anything HTML-like (including XML) with regex. You're missing out on everything XML is about, not the least of which its flexibility with hand-written "mistakes" like different order of tags, spaces etc.

Blindy
  • 65,249
  • 10
  • 91
  • 131
1

It might be a difference in end of line characters. Try adding an optional \r to the regex.

WW.
  • 23,793
  • 13
  • 94
  • 121
1

Very probably because of the line endings. The dot does not match line endings by default, and you explicitly look for \n in your regex.

Try and compile your pattern with Pattern.DOTALL, or put \r?\n everywhere you have \n in the regex.

fge
  • 119,121
  • 33
  • 254
  • 329