1

I've done a bunch of searching but I'm terrible with regex statements and my google-fu in this instance as not been strong.

Scenario:

In push notifications, we're passed a URL that contains a 9-digit content ID.

Example URL: http://www.something.com/foo/bar/Some-title-Goes-here-123456789.html (123456789 is the content ID in this scenario)

Current regex to parse the content ID:

public String getContentIdFromPathAndQueryString(String path, String queryString) {
        String contentId = null;
        if (StringUtils.isNonEmpty(path)) {
            Pattern p = Pattern.compile("([\\d]{9})(?=.html)");
            Matcher m = p.matcher(path);
            if (m.find()) {
                contentId = m.group();
            } else if (StringUtils.isNonEmpty(queryString)) {
                p = Pattern.compile("(?:contentId=)([\\d]{9})(?=.html)");
                m = p.matcher(queryString);
                if (m.find()) {
                    contentId = m.group();
                }
            }
        }

        Log.d(LOG_TAG, "Content id " + (contentId == null ? "not found" : (" found - " + contentId)));
        if (StringUtils.isEmpty(contentId)) {
            Answers.getInstance().logCustom(new CustomEvent("eid_url")
                    .putCustomAttribute("contentId", "empty")
                    .putCustomAttribute("path", path)
                    .putCustomAttribute("query", queryString));
        }

        return contentId;
    }

The problem: This does the job but there's a specific error scenario that I need to account for.

Whoever creates the push may put in the wrong length content ID and we need to grab it regardless of that, so assume it can be any number of digits... the title can also contain digits, which is annoying. The content ID will ALWAYS be followed by ".html"

Psest328
  • 6,575
  • 11
  • 55
  • 90

1 Answers1

1

While the basic answer here would be just "replace {9} limiting quantifier matching exactly 9 occurrences with a + quantifier matching 1+ occurrences", there are two patterns that can be improved.

The unescaped dot should be escaped in the pattern to match a literal dot.

If you have no overlapping matches, no need to use a positive lookahead with a capturing group before it, just keep the capturing group and grab .group(1) value.

A non-capturing group (?:...) is still a consuming pattern, and the (?:contentId=) equals contentId= (you may remove (?: and )).

There is no need wrapping a single atom within a character class, use \\d instead of [\\d]. That [\\d] is actually a source of misunderstandings, some may think it is a grouping construct, and might try adding alternative sequences into the square brackets, while [...] matches a single char.

So, your code can look like

        Pattern p = Pattern.compile("(\\d+)\\.html");     // No lookahead, + instead of {9}
        Matcher m = p.matcher(path);
        if (m.find()) {
            contentId = m.group(1);                       // (1) refers to Group 1
        } else if (StringUtils.isNonEmpty(queryString)) {
            p = Pattern.compile("contentId=(\\d+)\\.html");
            m = p.matcher(queryString);
            if (m.find()) {
                contentId = m.group(1);
            }
        }
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563