3

I've got a Java program that goes through a file line by line, and tries to match each line against one of four regular expressions. Depending on which expression matched, the program performs a certain action. Here's what I have:

private void processFile(ArrayList<String> lines) {
    ArrayList<Component> Components = new ArrayList<>();
    Pattern pattern = Pattern.compile(
            "Object name\\.{7}: (.++)|"
            + "\\{CAT=([^\\}]++)\\}|"
            + "\\{CODE=([^\\}]++)\\}|"
            + "\\{DESC=([^\\}]++)\\}");

    Matcher matcher;
    // Go through each line and see if the line matches the any of the regexes
    // defined
    Component currentComponent = null;

    for (String line : lines) {
        matcher = pattern.matcher(line);

        if (matcher.find()) {
            // We found a tag. Find out which one
            String match = matcher.group();

            if (match.startsWith("Obj")) {
                // We've got the object name
                if (currentComponent != null) {
                    Components.add(currentComponent);
                }
                currentComponent = new Component();
                currentComponent.setName(matcher.group(1));
            } else if (currentComponent != null) {
                if (match.startsWith("{CAT")) {
                    currentComponent.setCategory(matcher.group(2));
                } else if (match.startsWith("{CODE")) {
                    currentComponent.setOrderCode(matcher.group(3));
                } else if (match.startsWith("{DESC")) {
                    currentComponent.setDescription(matcher.group(4));
                }
            }
        }
    }

    if (currentComponent != null) {
        Components.add(currentComponent);
    }
}

As you can see, I've combined the four regular expressions into one and apply the whole regex on the line. If a match is found, I check the start of the string to determine which expression was matched and then extract the data from the group. In case anyone is interested in running the code, some sample data is presented below:

Object name.......: PMF3800SN
Last modified.....: Wednesday 9 November 2011 11:55:04 AM
File offset (hex).: 00140598 (Hex).
Checksum (hex)....: C1C0 (Hex).
Size (bytes)......: 1,736
Properties........: {*DEVICE}
                    {PREFIX=Q}
                    {*PROPDEFS}
                    {PACKAGE="PCB Package",PACKAGE,1,SOT-323 MOSFET}
                    {*INDEX}
                    {CAT=Transistors}
                    {SUBCAT=MOSFET}
                    {MFR=NXP}
                    {DESC=N-channel TrenchMOS standard level FET with ESD protection}
                    {CODE=1894711}
                    {*COMPONENT}

                    {PACKAGE=SOT-323 MOSFET}
                    *PINOUT SOT-323 MOSFET
                    {ELEMENTS=1}
                    {PIN "D" = D}
                    {PIN "G" = G}
                    {PIN "S" = S}

Although my code works, I don't like the fact that I repeat part of the string later on when calling the startsWith routines.

I'm curious to see how others would have written this.

Amr

Amr Bekhit
  • 4,613
  • 8
  • 34
  • 56
  • 1
    Is there a reason why you didn't create four distinct regex instances and run them one after the other on each line until one matches? – Daniel Hilgarth Apr 12 '12 at 14:42
  • @DanielHilgarth It is not going to be a feasible solution I think. – Phani Apr 12 '12 at 14:43
  • This might solve your problem. http://stackoverflow.com/questions/895279/regular-expression-to-match-multiple-query-string-parameter-value-pairs – Phani Apr 12 '12 at 14:45
  • Not feasible? Do you have a reason for thinking that? (Do you have a reason for thinking that anything you try to do differently will work in _any other way_ under the hood?) – Donal Fellows Apr 12 '12 at 14:46
  • I completely agree with your point,if it's not we then JVM will does automatically. But the point is can we tweak the code rather than writing the same code again and again. Like axtavt replied. – Phani Apr 12 '12 at 14:48
  • @DanielHilgarth: I did consider doing that, but didn't like the fact that I had to recreate the matcher for each expression based on the input string. This would work fine in my original question, where the file is processed line by line. However, if I was processing the entire file through a single string, I wouldn't be able to simply create the matchers by calling Pattern.matcher(fileString) as it would start matching from the beginning of the file. I'd have to manually split the string from the last match and pass that. – Amr Bekhit Apr 12 '12 at 14:56
  • Edit: I've since looked harder at the documentation, and found that you can change the start position of a matcher, so I guess using four distinct regex's wouldn't be as bad as I initially thought. – Amr Bekhit Apr 12 '12 at 14:56
  • 1
    You can also assign a new Pattern to an existing Matcher without changing the current match position ([ref](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#usePattern%28java.util.regex.Pattern%29)). It's an extremely useful feature in certain circumstances, but I don't think this is one of them. It would make the code considerably more complicated without adding enough value to justify it. – Alan Moore Apr 12 '12 at 17:56

3 Answers3

3

group() returns null for groups that failed to match. So, you can put your subexpressions into groups and check them for nulls after match:

Pattern pattern = Pattern.compile(
         "(Object name\\.{7}: (.++))|"
         + "(\\{CAT=([^\\}]++)\\})|"
         + "(\\{CODE=([^\\}]++)\\})|"
         + "(\\{DESC=([^\\}]++)\\})"); 
...
if (match.group(1) != null) { // Object ...
    ...
} ...

Actually, you can even do it with your existing groups if you don't have |s inside your subexpressions.

axtavt
  • 239,438
  • 41
  • 511
  • 482
2

As @axtavt pointed out, you can discover directly whether a group participated in the match or not. You don't even have to change the regex; you already have one capturing group for each alternative. I like to use the start(n) method for the tests because it seems neater, but checking group(n) for a null value (as @axtavt did) yields the same result. Here's an example:

private static void processFile(ArrayList<String> lines) {

    Pattern p = Pattern.compile(
            "Object name\\.{7}: (.++)|"
            + "\\{CAT=([^\\}]++)\\}|"
            + "\\{CODE=([^\\}]++)\\}|"
            + "\\{DESC=([^\\}]++)\\}");

    // Create the Matcher now and reassign it to each line as we go.
    Matcher m = p.matcher("");

    for (String line : lines) {
        if (m.reset(line).find()) {
            // If group #n participated in the match, start(n) will be non-negative.
            if (m.start(1) != -1) {
                System.out.printf("%ncreating new component...%n");
                System.out.printf("  name: %s%n", m.group(1));
            } else if (m.start(2) != -1) {
                System.out.printf("  category: %s%n", m.group(2));
            } else if (m.start(3) != -1) {
                System.out.printf("  order code: %s%n", m.group(3));
            } else if (m.start(4) != -1) {
                System.out.printf("  description: %s%n", m.group(4));
            }
        }
    }
}

However, I'm not sure I agree with your reasoning about repeating part of the string in the code. If the data format changes, or you change which fields you extract, it seems like it would be easier to get out of sync when you update the code. In other words, your current code isn't redundant, it's self-documenting. :D

EDIT: You mentioned in a comment the possibility of processing the whole file at once instead of line by line. That's actually the easier way:

private static void processFile(String contents) {

    Pattern p = Pattern.compile(
            "Object name\\.{7}: (.++)|"
            + "\\{CAT=([^\\}]++)\\}|"
            + "\\{CODE=([^\\}]++)\\}|"
            + "\\{DESC=([^\\}]++)\\}");

    Matcher m = p.matcher(contents);

    while (m.find()) {
        if (m.start(1) != -1) {
            System.out.printf("%ncreating new component...%n");
            System.out.printf("  name: %s%n", m.group(1));
        } else if (m.start(2) != -1) {
            System.out.printf("  category: %s%n", m.group(2));
        } else if (m.start(3) != -1) {
            System.out.printf("  order code: %s%n", m.group(3));
        } else if (m.start(4) != -1) {
            System.out.printf("  description: %s%n", m.group(4));
        }
    }
}
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
0

i'd define a meta object that is a pattern + a runnable. loop over the lines, then loop over the meta objects. if one matches, exec the runnable. something like,

class Meta {
  Pattern pattern;
  Runnable runnable;
  Matcher matcher;

  Meta(Pattern p, Runnable r) {
    pattern = p;
    runnable = r;
  }
}

Meta[] metas = new Meta[] { new Meta(Pattern.compile(...), new Runnable() { ... }), new Meta(...), ... };


for (String line : lines) {
  for (Meta meta : metas) {
    final Matcher matcher = meta.pattern.matcher(line);
    if (matcher.matches()) {
      meta.matcher = matcher;
      meta.runnable.run();
    }
  }
}

here's what the Meta object for the "Object" lines would look like,

Meta m = new Meta(Pattern.compile("Object name\\.{7}: (.++)", new Runnable() {
  // We've got the object name
  if (currentComponent != null) {
    Components.add(currentComponent);
  }
  currentComponent = new Component();
  currentComponent.setName(matcher.group(1));
});
Jeffrey Blattman
  • 22,176
  • 9
  • 79
  • 134