Regex with error checking

Question

I've done a bunch of searching but I'm terrible with regex statements and my google-fu in this instance as not been strong.

Scenario:

In push notifications, we're passed a URL that contains a 9-digit content ID.

Example URL: http://www.something.com/foo/bar/Some-title-Goes-here-123456789.html (123456789 is the content ID in this scenario)

Current regex to parse the content ID:

public String getContentIdFromPathAndQueryString(String path, String queryString) {
        String contentId = null;
        if (StringUtils.isNonEmpty(path)) {
            Pattern p = Pattern.compile("([\\d]{9})(?=.html)");
            Matcher m = p.matcher(path);
            if (m.find()) {
                contentId = m.group();
            } else if (StringUtils.isNonEmpty(queryString)) {
                p = Pattern.compile("(?:contentId=)([\\d]{9})(?=.html)");
                m = p.matcher(queryString);
                if (m.find()) {
                    contentId = m.group();
                }
            }
        }

        Log.d(LOG_TAG, "Content id " + (contentId == null ? "not found" : (" found - " + contentId)));
        if (StringUtils.isEmpty(contentId)) {
            Answers.getInstance().logCustom(new CustomEvent("eid_url")
                    .putCustomAttribute("contentId", "empty")
                    .putCustomAttribute("path", path)
                    .putCustomAttribute("query", queryString));
        }

        return contentId;
    }

The problem: This does the job but there's a specific error scenario that I need to account for.

Whoever creates the push may put in the wrong length content ID and we need to grab it regardless of that, so assume it can be any number of digits... the title can also contain digits, which is annoying. The content ID will ALWAYS be followed by ".html"

would that bypass any numbers in the title portion before the content ID? — Psest328, Jul 27 '17 at 19:59
great tool! thank you. To confirm, the regex for java in android should be: Pattern p = Pattern.compile(([\d]+)(?=.html)"); correct? — Psest328, Jul 27 '17 at 20:03
Yes, you need no capturing group if you want to use `.group()`, the whole match value. If you need just digits in the second, use `contentId=(\\d+)\\.html"` and access the number with `.group(1)`. — Wiktor Stribiżew, Jul 27 '17 at 20:04
wow.. that was super quick and easy. thank you so much. I guess find a way to wrap that neatly into an answer and you'll get the check mark, my friend :) — Psest328, Jul 27 '17 at 20:06

score 1 · Accepted Answer · answered Jul 27 '17 at 20:13

While the basic answer here would be just "replace {9} limiting quantifier matching exactly 9 occurrences with a + quantifier matching 1+ occurrences", there are two patterns that can be improved.

The unescaped dot should be escaped in the pattern to match a literal dot.

If you have no overlapping matches, no need to use a positive lookahead with a capturing group before it, just keep the capturing group and grab .group(1) value.

A non-capturing group (?:...) is still a consuming pattern, and the (?:contentId=) equals contentId= (you may remove (?: and )).

There is no need wrapping a single atom within a character class, use \\d instead of [\\d]. That [\\d] is actually a source of misunderstandings, some may think it is a grouping construct, and might try adding alternative sequences into the square brackets, while [...] matches a single char.

So, your code can look like

        Pattern p = Pattern.compile("(\\d+)\\.html");     // No lookahead, + instead of {9}
        Matcher m = p.matcher(path);
        if (m.find()) {
            contentId = m.group(1);                       // (1) refers to Group 1
        } else if (StringUtils.isNonEmpty(queryString)) {
            p = Pattern.compile("contentId=(\\d+)\\.html");
            m = p.matcher(queryString);
            if (m.find()) {
                contentId = m.group(1);
            }
        }

Regex with error checking

1 Answers1