Can I combine a regex with a substring() in Java?

Question

We're in the process of importing our documentation library to SharePoint, and I'm using a java program I wrote to build metadata for these documents. One of the things that I need to do is determine if a document has a cross referenced document. This condition is defined as having the phrase "see " in the document name. However, naming conventions are nonexistent, and all of the following variations exist:

document_see_other_document.doc
document_-_see_other_document.doc
document_(see_other_document).doc
document_[see_other_document].doc
document_{see_other_document}.doc

I have created a variable which defaults as such: String xref = "no cross reference"; I would like to set this String to "see_other_document" in cases where there is a see <other document> substring in the filename.

My plan is to look for an instance of see_, use that as the start point of a substring, ending with the ., non-inclusive. But I want to ELIMINATE any special characters that may exist. In my cases above, I would like to return five instances of other_document, not other_document), etc.

My thought was to pull the substring into a variable, then use a regex [^a-zA-Z0-9] and replace non-alphanumeric characters in that second string variable, but is there a better, more elegant way to skin this cat?

PSEUDOCODE:

if (filename.indexOf("see_">-1) {
    String tempFilename = fileName.substring(indexOf("see_")+4,indexOf("."-1));
    xref = tempFilename.replaceAll("[^a-zA-Z0-9]","");
    } else {
    xref;
}

Yes, you can use regex capturing groups. See [this question][1] for how to do it. [1]: http://stackoverflow.com/questions/1277157/java-regex-replace-with-capturing-group — Robin Green, Dec 29 '13 at 16:20

JosefN · Answer 1 · 2013-12-29T16:35:04.403

you can use regex with optional parts. Following fragments shows how. (?:something ) is non capturing group:

    Pattern patt = Pattern.compile("_(?:\\-_)?(?:\\(|\\[|\\{)?see_([a-zA-Z_0-9]+)(?:\\)\\}|\\])?");

    for (String filename : new String[] {"document_see_other_document.doc", "document_-_see_other_document2.doc", 
            "document_(see_other_3document).doc", "document_[see_other_4document].doc", "document_{see_other_document5}.doc", "blacksee_other_document.doc"}){
        Matcher m= patt.matcher(filename);

        if (m.find()){
            System.out.println(m.group(1));
        }
        else {
            System.out.println("negative");
        }

    }

You should use character classes to match those bracketing characters, i.e. `[(\\[{]?` and `[}\\])]?`. Your way works okay, but look how much easier it is to read this way. — Alan Moore, Dec 29 '13 at 16:43

score 0 · Answer 2 · answered Dec 29 '13 at 16:27

As Steve McConnell suggests writing something in one line is not more elegant. I believe that your way of doing things is the most elegant one.

Let's suppose that you find a magical way of using a complex regular expression doing all these things in one line.

Would the code be more readable then? Certainly not. Using complex regular expressions is far from easy reading. Nobody would understand what you want to do by reading a regular expression.

Would the code be more maintainable? Certainly not. Changing a regular expression to do a slightly different match could be a very tedious task. The same with debugging.

Would the code be faster? Maybe yes, maybe no. You would have to test it. Nevertheless, the performance difference is not your goal.

Therefore, I suppose your code is elegant enough and I would not suggerst to change it.

Hope I helped!

Alan Moore · Answer 3 · 2013-12-29T17:02:24.950

In all of your exemplars, the junk characters occur just before and just after the message (see_other_document). And the message itself consists entirely of word characters (i.e., no punctuation and no whitespace). Can we count on all those conditions? If we can, this should put you right:

    String result = source.replaceAll(
         "(document_)[\\W_]*+(see_\\w++)[^\\w.]*+(\\.doc)",
         "$1$2$3");

The basic idea is if you don't want it, don't capture it.

score 0 · Answer 4 · answered Dec 29 '13 at 16:44

your code is fine actually, but you can try this:

if(filename.indexOf("see_">=0){
    String temp=filename.substring(filename.indexOf("see_")+4,filename.length()-4);
//                      if there exist '.' in "other_document"^
    xref=temp.replaceAll("[^\\p{L}0-9]","");
//                          ^here for unicode character
} else{
    xref;
}

Can I combine a regex with a substring() in Java?

4 Answers4