2

My group could either be of the form x/y, x.y or x_y.z. Each group is separated by an underscore. The groups are unordered.

Example:

ABC/DEF_abc.def_PQR/STU_ghi_jkl.mno

I would like to capture the following:

ABC/DEF
abc.def
PQR/STU
ghi_jkl.mno

I have done this using a fairly verbose string iteration and parsing method (shown below), but am wondering if a simple regex can accomplish this.

private static ArrayList<String> go(String s){
    ArrayList<String> list = new ArrayList<String>();
    boolean inSlash = false;
    int pos = 0 ;
    boolean inDot = false;
    for(int i = 0 ; i < s.length(); i++){
        char c = s.charAt(i);
        switch (c) {
        case '/':
            inSlash = true;
            break;
        case '_':
            if(inSlash){
                list.add(s.substring(pos,i));
                inSlash = false;
                pos = i+1 ;
            }
            else if (inDot){
                list.add(s.substring(pos,i));
                inDot = false;
                pos = i+1;
            }
            break;
        case '.':
            inDot = true;
            break;
        default:
            break;
        }

    }
    list.add(s.substring(pos));
    System.out.println(list);
    return list;
}
dogbane
  • 266,786
  • 75
  • 396
  • 414
  • The underscore can be delimiter as well as part of a group?? – Andreas Dolk Dec 08 '10 at 12:49
  • The difficulty seems to be in the last type of group (with the underscore in it). Could you elaborate a little bit on the rules for when an underscore should be part of a group, and when it should be the separator character? Perhaps you could post your current code. – Jordi Dec 08 '10 at 12:50
  • yes, that's the fun part :) Maybe some way to look ahead for a dot and then determine if it is a delim or group? – dogbane Dec 08 '10 at 12:51

4 Answers4

2

Have a try with:

((?:[^_./]+/[^_./]+)|(?:[^_./]+\.[^_./]+)|(?:[^_./]+(?:_[^_./]+)+\.[^_./]+))

I don't know java syntax but in Perl:

#!/usr/bin/perl
use 5.10.1;
use strict;
use warnings;

my $str = q!ABC/DEF_abc.def_PQR/STU_ghi_jkl.mno_a_b_c.z_a_b_c_d.z_a_b_c_d_e.z!;
my $re = qr!((?:[^_./]+/[^_./]+)|(?:[^_./]+\.[^_./]+)|(?:[^_./]+(?:_[^_./]+)+\.[^_./]+))!;
while($str=~/$re/g) {
    say $1;
}

will produce:

ABC/DEF
abc.def
PQR/STU
ghi_jkl.mno
a_b_c.z
a_b_c_d.z
a_b_c_d_e.z
Toto
  • 89,455
  • 62
  • 89
  • 125
  • Great, this works! Is it possible to change the last part so that it can match the forms a_b_c.z, a_b_c_d.z, a_b_c_d_e.z etc? – dogbane Dec 08 '10 at 13:28
  • Java regexes are *in appearance* very like Perl 5.0 regexes from around 1993, but support a couple newer features like possessive matching. They don’t support any of the modern constructs. The apparent similarity’s a case of false cognates—of *faux amis* so to speak. **They don’t work on Unicode correctly** without [this fix](http://training.perl.com/scripts/tchrist-unicode-charclasses__alpha.java), and their `\b` and `\B` are so terribly broken that strings like `"élève"` won’t in Java match the pattern `/\b\w+\b/` (`"\\b\\w+\\b"`) **anywhere at all**. At least, not without my fix. – tchrist Dec 08 '10 at 15:13
0

There might be a problem with the underscore since it's not always a separator.

Maybe: ((?<=_)\w+_)?\w+[./]\.w+

TomaszK
  • 435
  • 2
  • 4
  • Please be exceedingly cautious using `\w` in Java regexes: it’s [almost always wrong](http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261). ☹ – tchrist Dec 08 '10 at 15:07
  • I was just following the javadoc to `java.util.regex.Pattern`. :) – TomaszK Dec 08 '10 at 21:39
  • That is part of the problem, unfortunately. – tchrist Dec 09 '10 at 15:57
0

This regex would probably do (tested with .Net regular expressions):

[a-zA-Z]+[./][a-zA-Z]+|[a-zA-Z]+_[a-zA-Z]+\.[a-zA-Z]+

(If you know your input is well formed there is no need to explicitly match the separator)

heijp06
  • 11,558
  • 1
  • 40
  • 60
  • Please do not use `[a-zA-Z]` as a crippled synonym for `\pL`. :( – tchrist Dec 08 '10 at 15:17
  • @tchrist: You are right (of course) and I am lazy (which is a virtue of a programmer, I seem to recall reading that somewhere...) – heijp06 Dec 08 '10 at 16:22
  • there is good-laziness and there is bad-laziness. Bad-laziness avoids work now by promising to work (a lot) more in the future. Good-laziness avoids work in the future by a bit of extra work now. :) – tchrist Dec 08 '10 at 17:03
  • @tchrist: Of course I agree. Rest assured that I have learned a lot from your comments. Not only with regard to character classes but more importantly with regard to the quality of my work in general. i.e. be lazy where I can and diligent where I should. Thank you for your time. – heijp06 Dec 08 '10 at 19:27
0

This one goes with positive lookahead instead of alternations

[A-Za-z]+(_(?=[A-Za-z]+\.[A-Za-z]+))?[A-Za-z]+[/.][A-Za-z]+
mcveat
  • 1,416
  • 15
  • 34
  • Please do not use `[A-Z]` or `[a-z]` in a regex when `\pL` is what you actually mean—which it usually really is. – tchrist Dec 08 '10 at 15:07