Reversing a regex based parser

Question

I inherited a bank interface parser. The previous developer actually did this pretty slick. The file that comes in from the bank is a fixed length field. The way he parses that record from the download is this

    public static final String HEADER_RECORD_REGEX = "^(\\d{3})(\\d{12})(.{20})(\\d\\d)(\\d\\d)(\\d\\d)(\\d{12})(\\d\\d)$";

private static final int BANK_ID      = 1;
    private static final int ACCOUNT_ID   = 2;
    private static final int COMPANY_NAME = 3;
    private static final int MONTH              = 4;
    private static final int DAY                    = 5;
    private static final int YEAR                 = 6;
    private static final int SEQUENCE     = 7;
    private static final int TYPE_CODE      = 8;
    private static final int GROUP_COUNT  = TYPE_CODE;

if ( GROUP_COUNT == matcher.groupCount() )  {
            setBankId( matcher.group( BANK_ID ) );
            setAccountId( matcher.group( ACCOUNT_ID ) );
            setCompanyName( matcher.group( COMPANY_NAME ) );
            setProcessDate( matcher.group( MONTH ), matcher.group( DAY ),
                            matcher.group( YEAR ) );
            setSeqNumber( matcher.group( SEQUENCE ) );
            setTypeCode( matcher.group( TYPE_CODE ) );
        }

I have a new requirement to reverse this process and actually generate mock files from the bank so we can test. Using this method, is there a way i can reverse the process using this same regex method to generate the file or do i just go back to building a standard parser.

thanks

What do you mean by reverse? Create fixed length data files from a mock result file? — , Nov 08 '12 at 21:02
http://stackoverflow.com/questions/748253/how-to-generate-random-strings-that-match-a-given-regexp — durron597, Nov 08 '12 at 21:13
Well, reversing this process is not parsing, its formatting. So, `new Formatter().format("%3.3s%12.12s%20.20s%2.2s%2.2s%2.2s%12.12s%2.2s", bankID, acctID, companyName, month, day, year, seq, typeCode);` or something similar. Also, in retrospect, it would have been much more useful for our clever parser writer to programmatically define the lengths of each field instead of hardcoding them into his regex. — Wug, Nov 08 '12 at 21:24
I hope they didn't fire the guy who wrote this originally, because of all the horrible ways you can solve a problem like this, his is totally not bad at all. — Wug, Nov 08 '12 at 21:39
That bit with the `GROUP_COUNT` doesn't make sense. The value returned by `matcher.groupCount()` is a static property of the Pattern object associated with the Matcher. It will always be the same, even if the match attempt fails. — Alan Moore, Nov 08 '12 at 22:18
Wug, I agree. The rest of the system, especially their database layer is absolutely horrid, but this small piece was pretty solid. I have been building parsers for years, and this was a new approach for me. Its definitely going into my war chest. — scphantm, Nov 09 '12 at 12:40

score 1 · Accepted Answer · answered Nov 08 '12 at 21:32

This basically does what you ask for. You can play with it until it suits your needs.

import java.util.*;

class Main
{
    public static String getLine(String bankID, String acctID, String companyName, String month, String day, String year, String seq, String typeCode)
    {
        return new Formatter()
               .format("%3.3s%12.12s%20.20s%2.2s%2.2s%2.2s%12.12s%2.2s", 
                       bankID, acctID, companyName, month,
                       day, year, seq, typeCode)
               .toString(); // 1 semicolon, technically a 1 liner.  aww yeah
    }

    public static void main(String[] args)
    {
        String tester = "123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
        System.out.println(getLine(tester, tester, tester, tester,
                                   tester, tester, tester, tester));
    }
}

The output of that example is:

123123456789ABC123456789ABCDEFGHIJK121212123456789ABC12

Here's the ideone.

Also lol. Ideone reports that this had a runspace of about 250MB. God, java is awful sometimes. — Wug, Nov 08 '12 at 21:36

score 0 · Answer 2 · answered Nov 08 '12 at 21:17

If by reversing you mean outputting an object to a file, then parser is not what you need. All you need to do is implement a method that outputs the same data members using a similar format to a file. You can use String.format with the field lengths you have in the regex. With some refactoring, you can extract commonalities between the regex and the string format, although you might consider this an overkill, as this regex is fairly simple.

score 0 · Answer 3 · answered Nov 08 '12 at 23:38

You need to step away from letting the regex control you. If you define your structure in another way (I use an enum below) from which you can derive your regex and a formatter then not only will the code become much more extensible but you will also be able to make a marshaller and an unmarshaller from it too.

Something like this may be a good start:

public class BankRecords {
  static enum AccountField {
    BANK_ID("\\d", 3) {
      @Override
      void fill ( Account a, String s ) {
        a.bankId = s;
      }
    },
    ACCOUNT_ID("\\d", 12) {
      @Override
      void fill ( Account a, String s ) {
        a.accountID = s;
      }
    },
    COMPANY_NAME(".", 20) {
      @Override
      void fill ( Account a, String s ) {
        a.companyName = s;
      }
    },
    MONTH("\\d", 2) {
      @Override
      void fill ( Account a, String s ) {
        a.month = s;
      }
    },
    DAY("\\d", 2) {
      @Override
      void fill ( Account a, String s ) {
        a.day = s;
      }
    },
    YEAR("\\d", 2) {
      @Override
      void fill ( Account a, String s ) {
        a.year = s;
      }
    },
    SEQUENCE("\\d", 12) {
      @Override
      void fill ( Account a, String s ) {
        a.seqNumber = s;
      }
    },
    TYPE_CODE("\\d", 2) {
      @Override
      void fill ( Account a, String s ) {
        a.typeCode = s;
      }
    };
    // The type string in the regex.
    final String type;
    // How many characters.
    final int count;

    AccountField(String type, int count) {
      this.type = type;
      this.count = count;
    }

    // Each field can fill its part in the Account.
    abstract void fill ( Account a, String s );

    // My pattern.
    static Pattern pattern = Pattern.compile(asRegex());

    public static Account parse ( String record ) {
      Account account = new Account ();
      // Fire off the matcher with the regex and put each field in the Account object.
      Matcher matcher = pattern.matcher(record);
      for ( AccountField f : AccountField.values() ) {
        f.fill(account, matcher.group(f.ordinal() + 1));
      }
      return account;
    }

    public static String format ( Account account ) {
      StringBuilder s = new StringBuilder ();
      // Roll each field of the account into the string using the correct length from the enum.
      return s.toString();
    }

    private static String regex = null;

    static String asRegex() {
      // Only do this once.
      if (regex == null) {
        // Grow my regex from the field definitions.
        StringBuilder r = new StringBuilder("^");
        for (AccountField f : AccountField.values()) {
          r.append("(").append(f.type);
          // Special case count = 1 or 2.
          switch (f.count) {
            case 1:
              break;
            case 2:
              // Just one more.
              r.append(f.type);
              break;
            default:
              // More than that shoudl use the {} notation.
              r.append("{").append(f.count).append("}");
              break;
          }
          r.append(")");
        }
        // End of record.
        r.append("$");
        regex = r.toString();
      }
      return regex;
    }
  }

  public static class Account {
    String bankId;
    String accountID;
    String companyName;
    String month;
    String day;
    String year;
    String seqNumber;
    String typeCode;
  }
}

Note how each enum encapsulates the essence of each field. The type, the number of characters and where it goes in the Account object.

I have done this in the past. It gets out of hand pretty quick. I wasn't very happy with it. — scphantm, Nov 09 '12 at 12:54
You have chosen a solution that uses a completely different mechanism to format the data than to parse it. If your data format ever changes you will need to change your code in two different places. I'd call THAT "out of hand". Anywhoo ... your decision. :) — OldCurmudgeon, Nov 09 '12 at 13:33

Reversing a regex based parser

3 Answers3