2

I have a strange situation which I find difficult to understand regarding regex matcher.

When I pass the next input parameter issueBody to the matcher, the matcher.find() always return false, while passing a hard-coded String with the same value as the issueBody - it works as expected.

The regex function:

private Map<String, String> extractCodeSnippet(Set<String> resolvedIssueCodeLines, String issueBody) {
        String codeSnippetForCodeLinePattern = "\\(Line #%s\\).*\\W\\`{3}\\W+(.*)(?=\\W+\\`{3})";
        Map<String, String> resolvedIssuesMap = new HashMap<>();

        for (String currentResolvedIssue : resolvedIssueCodeLines) {
            String currentCodeLinePattern = String.format(codeSnippetForCodeLinePattern, currentResolvedIssue);

            Pattern pattern = Pattern.compile(currentCodeLinePattern, Pattern.MULTILINE);
            Matcher matcher = pattern.matcher(issueBody);

            while (matcher.find()) {
                resolvedIssuesMap.put(currentResolvedIssue, matcher.group());
            }
        }
        return resolvedIssuesMap;
    }

The following always return false

Pattern pattern = Pattern.compile(currentCodeLinePattern, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(issueBody);

While the following always return true

Pattern pattern = Pattern.compile(currentCodeLinePattern, Pattern.MULTILINE);
Matcher matcher = pattern.matcher("**SQL_Injection** issue exists @ **VB_3845_112_lines/encode.frm** in branch **master**\n" +
                        "\n" +
                        "Severity: High\n" +
                        "\n" +
                        "CWE:89\n" +
                        "\n" +
                        "[Vulnerability details and guidance](https://cwe.mitre.org/data/definitions/89.html)\n" +
                        "\n" +
                        "[Internal Guidance](https://checkmarx.atlassian.net/wiki/spaces/AS/pages/79462432/Remediation+Guidance)\n" +
                        "\n" +
                        "[ppttx](http://WIN2K12-TEMP/bbcl/ViewerMain.aspx?planid=1010013&projectid=10005&pathid=1)\n" +
                        "\n" +
                        "Lines: 41 42 \n" +
                        "\n" +
                        "---\n" +
                        "[Code (Line #41):](null#L41)\n" +
                        "```\n" +
                        "    user_name = txtUserName.Text\n" +
                        "```\n" +
                        "---\n" +
                        "[Code (Line #42):](null#L42)\n" +
                        "```\n" +
                        "    password = txtPassword.Text\n" +
                        "```\n" +
                        "---\n");

My question is - why? what is the difference between the two statements?

VLAZ
  • 26,331
  • 9
  • 49
  • 67
nimrod
  • 151
  • 2
  • 11
  • 2
    Can you also show where/how you are assigning the issueBody string? Maybe something is wrong with that. – rhowell Mar 11 '20 at 18:39
  • My first guess is that issueBody was created from a byte stream using the wrong charset. – VGR Mar 11 '20 at 18:56
  • What is the value of currentCodeLinePattern – midhun mathew Mar 11 '20 at 18:57
  • *"Why?"* Because `issueBody` is **not** the same value as that string literal, regardless of your unproven claim that it is. To verify that, print the bytes of both: `System.out.println(Arrays.toString(issueBody.getBytes(StandardCharsets.UTF_8)));` and the same for the string literal, then compare the two. – Andreas Mar 11 '20 at 18:58
  • I added the whole function. **issueBody** is nothing but a string which related to an issue object... – nimrod Mar 11 '20 at 19:08
  • Try ``String codeSnippetForCodeLinePattern = "(?d)\\(Line #%s\\).*\\W`{3}\\W+(.*)(?=\\W+`{3})";`` – Wiktor Stribiżew Mar 11 '20 at 20:32
  • @WiktorStribiżew - it doesn't work.. the initial '(?d)' marked as incomplete group structure – nimrod Mar 11 '20 at 20:41
  • 1
    @Andreas - you are right. they are not the same. why is that and what needs to be done? – nimrod Mar 11 '20 at 20:42
  • 1
    Hm, and `Pattern pattern = Pattern.compile(currentCodeLinePattern, Pattern.UNIX_LINES);`? Note `Pattern.MULTILINE` is not necessary as your pattern has neither `^` nor `$`. – Wiktor Stribiżew Mar 11 '20 at 20:43
  • 1
    @WiktorStribiżew - it worked! I guess I never would have though the `Pattern.MULTILINE` is the problem... thanks! – nimrod Mar 11 '20 at 20:58
  • If my answer did not solve your problem please consider updating the question. – Wiktor Stribiżew Nov 28 '21 at 19:17

1 Answers1

0

TL;DR:

By using Pattern.UNIX_LINES, you tell Java regex engine to match with . any char but a newline, LF. Use

Pattern pattern = Pattern.compile(currentCodeLinePattern, Pattern.UNIX_LINES);

In your hard-coded string, you have only newlines, LF endings, while your issueBody most likely contains \r\n, CRLF endings. Your pattern only matches a single non-word char with \W (see \\W\\`{3} pattern part), but CRLF consists of two non-word chars. By default, . does not match line break chars, so it does not match neither \r, CR, nor \n, LF. The \(Line #%s\).*\W\`{3} fails right because of this:

  • \(Line #%s\) - matches (Line #<NUMBER>)
  • .* - matches 0 or more chars other than any line break char (up to CR or CRLF)
  • \W - matches a char other than a letter/digit/_ (so, only \r or \n)
  • \`{3} - 3 backticks - these are only matched if there was a \n ending, not \r\n (CRLF).

Again, by using Pattern.UNIX_LINES, you tell Java regex engine to match with . any char but a newline, LF.

BTW, Pattern.MULTILINE only makes ^ match at the start of each line, and $ to match at the end of each line, and since there are neither ^, nor $ in your pattern, you may safely discard this option.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563