2

I have a Scala regex which does match with pattern.findFirstMatchIn() but not with a match ... case unpacking statement:

val pattern = "\"(\\d+?)\",\"(.*?)\",(.+)$".r
val line = "\"1795\",\"title\",\"desc"

println(pattern.findFirstMatchIn(line).isDefined)

val pattern2Unpacking = line match {
  case pattern( category_id, title, description) =>
    true
  case _ => false
}

println(pattern2Unpacking)

The line to match is "1795","title","desc

 and the lack of a trailing quote is intentional.

Output is true and false rather than both true.

I have looked at this answer and this but I cannot relate the solutions to my problem. Omitting the boundary matchers does not change anything.

What is going wrong here?

Update following comments

Spoiler: Part of the apparent weirdness reported in the following is explained by the fact that my data contains characters not displayed by IDE, which is something to watch out for in these situations. For an excellent in-depth explanation what is actually happening, see the accepted answer.

Here is a screenshot from my IntelliJ. Code on top is copy paste from the link by @WiktorStribiżew. Code on the bottom is the one I based this post on. Output window is included in the screenshot. This is not a prank, and I find this a little scary.

Update 2

This is even better: http://ideone.com/KsIIc1

No, I neither fake screenshots nor did I hack ideone.com to play a prank.

enter image description here

Community
  • 1
  • 1
DCS
  • 3,354
  • 1
  • 24
  • 40
  • 1
    I find this like `val line = """"356","789","Title","bla
"""` to be cause of being it `false` because you need to escape quotes. –  Mar 22 '16 at 13:41
  • Damn, with escaped quotes it works. The problem is: The line with the quotes is just my toy example in order to post a complete code. My real problem is with a line coming from a CSV and a more complicated regex, which I dumbed down for readability. – DCS Mar 22 '16 at 13:46
  • Still, why does the original line yield `true` on `pattern.findFirstMatchIn`? – DCS Mar 22 '16 at 13:46
  • I'm not sure - but notice that it doesn't really match what you want it to match - if you print the result of `pattern.findFirstMatchIn(line)`, it prints `Some("356","789","Title","bla)` (notice the quotes are included!) - perhaps that's a lead on what's going wrong here. – Tzach Zohar Mar 22 '16 at 13:52
  • Bcause match is found like this `"356"` rest of the preceding quotes are not mathched. –  Mar 22 '16 at 13:55
  • Ok, back to square 1. I'm posting a properly escaped version of the problem that still fails in a sec. – DCS Mar 22 '16 at 13:55
  • @DCS: Can't you use `'` instead of `"` like this '"1795","title","desc
' ? In php it's convention. Might be in scala too. –  Mar 22 '16 at 14:19
  • @noob In scala, `'` is a char. Strings need double quotes. Also, the strings to match actually come from a (16 GB) CSV file which I cannot change. – DCS Mar 22 '16 at 14:25
  • I ran the updated code and I get `true` for both outputs. Same for both REPL and Intellij IDE. – jwvh Mar 22 '16 at 16:19
  • http://ideone.com/wvuZgw – Wiktor Stribiżew Mar 22 '16 at 16:31
  • Ok, this is getting a bit scary. Posting a screenshot from my IntelliJ - never thought I would have to use screenshots to prove that I am not a liar. Above code is copy paste from @WiktorStribiżew link. Second code is the one I based this post on. Same code (to me), different result. – DCS Mar 22 '16 at 20:24
  • @WiktorStribiżew Please see my ideone link. Copy-paste from my IntelliJ. – DCS Mar 22 '16 at 20:37
  • I would appreciate people not down voting or asking to close a question just because it documents behavior that is not trivial to reproduce. I do my best to prove my point is an actual problem. – DCS Mar 22 '16 at 20:40
  • @DCS: In your Edit 2, there is a U+2028 (LINE SEPARATOR) right after "desc". If you remove it, the IDEONE demo you posted will return 2 trues. Just as an experiment: add `(?s)` at the start of your pattern and retry. – Wiktor Stribiżew Mar 22 '16 at 20:42
  • @WiktorStribiżew That's it! This is what I now think that happened: I made the code sample by editing an original long troublesome string from my CSV data down to something smaller for posting. Despite of the editing the character seems to have survived, and it does not show up in IntelliJ. – DCS Mar 22 '16 at 20:50
  • So, do you want to keep the question and me to post an answer? Is the question still valid? – Wiktor Stribiżew Mar 22 '16 at 20:51
  • Given how crazy the problem looks when you encounter it for the first time, I think an answer might help somebody who encounters something similar. It basically cost me an afternoon. My view is that an answer helps future SO users, but if you disagree I'm happy to delete the post. – DCS Mar 22 '16 at 20:55
  • I think the real reason is that the `$` in the 2 cases was treated differently. In the first case, it was *the end of the string, or right before the final newline*, and in the second case, it was treated as a `\z`, *the very end of the string*. – Wiktor Stribiżew Mar 22 '16 at 21:00
  • Well, I think this question is valid after delving a bit deeper. I will post an answer. – Wiktor Stribiżew Mar 22 '16 at 21:16

1 Answers1

1

The difference here lies in how findFirstMatchIn uses the regex pattern: it does not use it as an anchored pattern while the match uses the anchored version. If you read the reference, you will see that findFirst methods do not use anchored patterns.

Now, why did findFirst find a match if your original string contains a U+2028 (LINE SEPARATOR) character at the end and the $ anchor is defined at the end of the pattern? It is because $ can match at the end of the string, or right at the last newline at the end of the string (equal to \Z anchor):

The end of the input but for the final terminator, if any

When Scala anchors the regex, it seems to use a \z logic, and only matches at the very end of the string (although reference says it uses ^ and $).

So, there are two ways to fix the issue:

  • Either add a (?s) DOTALL modifier to the beginning of your pattern (demo):
    val pattern = "(?s)\"(\\d+?)\",\"(.*?)\",(.+)$".r
  • Or use .unanchored when defining the regex pattern (demo):
    val pattern = "\"(\\d+?)\",\"(.*?)\",(.+)$".r.unanchored
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563