6

I have an incoming records filter stored with the logical clause as given below.

Acct1 = 'Y' AND Acct2 = 'N' AND Acct3 = 'N' AND Acct4 = 'N' AND Acct5 = 'N' AND ((Acct6 = 'N' OR Acct7 = 'N' AND Acct1 = 'Y') AND Formatted= 'N' AND Acct9 = 'N' AND (Acct10 = 'N' AND Acct11 = 'N') AND EditableField= 'N' )

My data input to this clause will be from Csv file as below.

Country,Type,Usage,Acct1,Acct2,Acct3,Acct4,Acct5,Acct6,Acct7,Formatted,Acct9,Acct10,Acct11,EditableField
USA,Premium,Corporate,Y,N,Y,N,N,N,Y,N,Y,N,Y,N,
Mexico,Premium,Corporate,Y,N,Y,N,Y,N,Y,N,Y,N,Y,N,
USA,Premium,Corporate,Y,N,Y,N,N,N,N,Y,Y,N,Y,N,
USA,Premium,Corporate,Y,N,Y,N,Y,N,Y,Y,Y,N,Y,N,

I will have to filter out the records in the file based on the conditions defined in the clause. This is a example of one simple clause but there will be more inner conditions than this and the clause can be changed whenever the user want and there will be 10 such clauses the records has to pass through sequentially.

So I am looking for a way to dynamically interpret the clause and apply it on the incoming records. Please provide me your suggestions about how to design/ any example if available.

Atom
  • 768
  • 1
  • 15
  • 35
  • 2
    You need a recursive descent expression parser, or the Dijkstra Shunting-yard algorithm. – user207421 Aug 24 '15 at 23:26
  • 2
    @EJP Consider me as a starter and kindly provide me an example that can help me understand. – Atom Aug 25 '15 at 00:00
  • You'll be pleased to know that boolean expressions are pretty easy to parse and evaluate. Here's how to write a recursive descent parser (and even how to build ASTs!): http://stackoverflow.com/questions/2245962/is-there-an-alternative-for-flex-bison-that-is-usable-on-8-bit-embedded-systems/2336769#2336769 – Ira Baxter Aug 26 '15 at 00:12
  • @minion Google can provide that as well as I can. I consider it too broad for an answer. – user207421 Aug 26 '15 at 01:31
  • 1
    Thanks.. I understand – Atom Aug 27 '15 at 21:33
  • Are your parentheses intentionally meaningless in that example query? – Eric Aug 27 '15 at 21:50
  • Are the column names **always** of the form `Acct12`? Then this could be solved ""pragmatically"" with a `StringTokenizer` and a few `String#substring` and `Integer#parseInt` calls.... – Marco13 Aug 28 '15 at 10:20
  • @Eric No. Sorry about that.. It can be mix of AND OR conditions and hence parantheses are important.. I have edited the example – Atom Aug 28 '15 at 22:10
  • @Marco13 edited my question – Atom Aug 28 '15 at 22:14
  • I know you already have a good answer, but I wanted to finish a second idea I was working on, so I still have a question: is it possible that the query contains the same field twice, and if so, can its value be Y in one condition and N in another, as in `(Acct1=Y AND Acct2=N) OR (Acct1=N AND Acct2=Y)` ? – m69's been on strike for years Aug 29 '15 at 00:19
  • @m69 Yes.. it will contain the same field twice. – Atom Aug 31 '15 at 17:57
  • The same field can be validated against different conditions as well. – Atom Aug 31 '15 at 18:56

4 Answers4

5

Here's the complete solution which does not include third-party libraries like ANTLR or JavaCC. Note that while it's extensible, its capabilities are still limited. If you want to create much more complex expressions, you'd better use grammar generator.

First, let's write a tokenizer which splits the input string to the tokens. Here's the token types:

private static enum TokenType {
    WHITESPACE, AND, OR, EQUALS, LEFT_PAREN, RIGHT_PAREN, IDENTIFIER, LITERAL, EOF
}

The token class itself:

private static class Token {
    final TokenType type;
    final int start; // start position in input (for error reporting)
    final String data; // payload

    public Token(TokenType type, int start, String data) {
        this.type = type;
        this.start = start;
        this.data = data;
    }

    @Override
    public String toString() {
        return type + "[" + data + "]";
    }
}

To simplify the tokenization let's create a regexp which reads the next token from the input string:

private static final Pattern TOKENS = 
        Pattern.compile("(\\s+)|(AND)|(OR)|(=)|(\\()|(\\))|(\\w+)|\'([^\']+)\'");

Note that it has many groups, one group per TokenType in the same order (first comes WHITESPACE, then AND and so on). Finally the tokenizer method:

private static TokenStream tokenize(String input) throws ParseException {
    Matcher matcher = TOKENS.matcher(input);
    List<Token> tokens = new ArrayList<>();
    int offset = 0;
    TokenType[] types = TokenType.values();
    while (offset != input.length()) {
        if (!matcher.find() || matcher.start() != offset) {
            throw new ParseException("Unexpected token at " + offset, offset);
        }
        for (int i = 0; i < types.length; i++) {
            if (matcher.group(i + 1) != null) {
                if (types[i] != TokenType.WHITESPACE)
                    tokens.add(new Token(types[i], offset, matcher.group(i + 1)));
                break;
            }
        }
        offset = matcher.end();
    }
    tokens.add(new Token(TokenType.EOF, input.length(), ""));
    return new TokenStream(tokens);
}

I'm using java.text.ParseException. Here we apply the regex Matcher till the end of the input. If it doesn't match at the current position, we throw an exception. Otherwise we look for found matching group and create a token from it ignoring the WHITESPACE tokens. Finally we add a EOF token which indicates the end of the input. The result is returned as special TokenStream object. Here's the TokenStream class which will help us to do the parsing:

private static class TokenStream {
    final List<Token> tokens;
    int offset = 0;

    public TokenStream(List<Token> tokens) {
        this.tokens = tokens;
    }

    // consume next token of given type (throw exception if type differs)
    public Token consume(TokenType type) throws ParseException {
        Token token = tokens.get(offset++);
        if (token.type != type) {
            throw new ParseException("Unexpected token at " + token.start
                    + ": " + token + " (was looking for " + type + ")",
                    token.start);
        }
        return token;
    }

    // consume token of given type (return null and don't advance if type differs)
    public Token consumeIf(TokenType type) {
        Token token = tokens.get(offset);
        if (token.type == type) {
            offset++;
            return token;
        }
        return null;
    }

    @Override
    public String toString() {
        return tokens.toString();
    }
}

So we have a tokenizer, hoorah. You can test it right now using System.out.println(tokenize("Acct1 = 'Y' AND (Acct2 = 'N' OR Acct3 = 'N')"));

Now let's write the parser which will create the tree-like representation of our expression. First the interface Expr for all the tree nodes:

public interface Expr {
    public boolean evaluate(Map<String, String> data);
}

Its only method used to evaluate the expression for given data set and return true if data set matches.

The most basic expression is the EqualsExpr which is like Acct1 = 'Y' or 'Y' = Acct1:

private static class EqualsExpr implements Expr {
    private final String identifier, literal;

    public EqualsExpr(TokenStream stream) throws ParseException {
        Token token = stream.consumeIf(TokenType.IDENTIFIER);
        if(token != null) {
            this.identifier = token.data;
            stream.consume(TokenType.EQUALS);
            this.literal = stream.consume(TokenType.LITERAL).data;
        } else {
            this.literal = stream.consume(TokenType.LITERAL).data;
            stream.consume(TokenType.EQUALS);
            this.identifier = stream.consume(TokenType.IDENTIFIER).data;
        }
    }

    @Override
    public String toString() {
        return identifier+"='"+literal+"'";
    }

    @Override
    public boolean evaluate(Map<String, String> data) {
        return literal.equals(data.get(identifier));
    }
}

The toString() method is just for information, you can remove it.

Next we will define the SubExpr class which is either EqualsExpr or something more complex in parentheses (if we see the parenthesis):

private static class SubExpr implements Expr {
    private final Expr child;

    public SubExpr(TokenStream stream) throws ParseException {
        if(stream.consumeIf(TokenType.LEFT_PAREN) != null) {
            child = new OrExpr(stream);
            stream.consume(TokenType.RIGHT_PAREN);
        } else {
            child = new EqualsExpr(stream);
        }
    }

    @Override
    public String toString() {
        return "("+child+")";
    }

    @Override
    public boolean evaluate(Map<String, String> data) {
        return child.evaluate(data);
    }
}

Next is AndExpr which is a set of SubExpr expressions joined by AND operator:

private static class AndExpr implements Expr {
    private final List<Expr> children = new ArrayList<>();

    public AndExpr(TokenStream stream) throws ParseException {
        do {
            children.add(new SubExpr(stream));
        } while(stream.consumeIf(TokenType.AND) != null);
    }

    @Override
    public String toString() {
        return children.stream().map(Object::toString).collect(Collectors.joining(" AND "));
    }

    @Override
    public boolean evaluate(Map<String, String> data) {
        for(Expr child : children) {
            if(!child.evaluate(data))
                return false;
        }
        return true;
    }
}

I use Java-8 Stream API in the toString for brevity. If you cannot use Java-8, you may rewrite it with the for loop or remove toString completely.

Finally we define OrExpr which is a set of AndExpr joined by OR (usually OR has lower priority than AND). It's very similar to AndExpr:

private static class OrExpr implements Expr {
    private final List<Expr> children = new ArrayList<>();

    public OrExpr(TokenStream stream) throws ParseException {
        do {
            children.add(new AndExpr(stream));
        } while(stream.consumeIf(TokenType.OR) != null);
    }

    @Override
    public String toString() {
        return children.stream().map(Object::toString).collect(Collectors.joining(" OR "));
    }

    @Override
    public boolean evaluate(Map<String, String> data) {
        for(Expr child : children) {
            if(child.evaluate(data))
                return true;
        }
        return false;
    }
}

And the final parse method:

public static Expr parse(TokenStream stream) throws ParseException {
    OrExpr expr = new OrExpr(stream);
    stream.consume(TokenType.EOF); // ensure that we parsed the whole input
    return expr;
}

So you can parse your expressions to get the Expr objects, then evaluate them against the rows of your CSV file. I assume that you're capable to parse the CSV row into the Map<String, String>. Here's usage example:

Map<String, String> data = new HashMap<>();
data.put("Acct1", "Y");
data.put("Acct2", "N");
data.put("Acct3", "Y");
data.put("Acct4", "N");

Expr expr = parse(tokenize("Acct1 = 'Y' AND (Acct2 = 'Y' OR Acct3 = 'Y')"));
System.out.println(expr.evaluate(data)); // true
expr = parse(tokenize("Acct1 = 'N' OR 'Y' = Acct2 AND Acct3 = 'Y'"));
System.out.println(expr.evaluate(data)); // false
Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
  • Really appreciate it. This logic is fine, need to test the processing time. – Atom Aug 28 '15 at 22:49
  • Can you help me understand "grammar generator" in java context because my expressions are complex than what is asked for. When searching for grammar generators it shows ANTLR tools :( – Atom Aug 28 '15 at 23:06
  • @minion, yes, ANTLR and JavaCC are grammar generators. You first write a grammar file using the special language like `jj` or `g4` and ANTLR/JavaCC is capable to autogenerate the Java code using it. The generated Java code uses the same principles like in my hand-crafted example (there are also tokens, token types, token stream, tokenizer and parser). While it's not very hard to extend my example to support more operators (like `NOT`, not equals, etc.), for complex grammar it's easier to write `jj` or `g4` code than the Java code. – Tagir Valeev Aug 29 '15 at 02:41
  • @TagirValeev really appreciate the logic. is there a way we can extend this logic for < , > also. I was trying to add GreaterThanExpr, but was not getting any success. – sinsuren May 04 '20 at 13:37
4

I don't know how efficient this will be in Java, but basic string-replace operations could be a simple solution for this.

You start with a query string:

Acct1 = 'Y' AND Acct2 = 'N' AND Acct3 = 'Y' AND Acct4 = 'N' AND Acct5 = 'N' OR ((Acct6 = 'N' OR Acct7 = 'N') AND Acct8 = 'N' AND Acct9 = 'Y' AND (Acct10 = 'N' OR Acct11 = 'N') AND Acct12 = 'N')

For each line in the csv, e.g. Y,N,Y,N,Y,N,Y,N,Y,N,Y,N string-replace the column headers in the query by the values; that gives you:

Y = 'Y' AND N = 'N' AND Y = 'Y' AND N = 'N' AND Y = 'N' OR ((N = 'N' OR Y = 'N') AND N = 'N' AND Y = 'Y' AND (N = 'N' OR Y = 'N') AND N = 'N')

Then replace the comparisons by their boolean value:
- replace N = 'N' and Y = 'Y' by Y
- replace N = 'Y' and Y = 'N' by N

This will result in:

Y AND Y AND Y AND Y AND N OR ((Y OR N) AND Y AND Y AND (Y OR N) AND Y)

Then loop through a number of string-replace operations which replace truthy values by Y and falsey values by N:
- replace Y AND Y by Y
- replace N AND N, N AND Y and Y AND N, by N
- replace Y OR Y, N OR Y and Y OR N, by Y
- replace N OR N by N
- replace (N) by N
- replace (Y) by Y

This will gradually reduce the boolean statement:

Y AND Y AND Y AND Y AND N OR ((Y OR N) AND Y AND Y AND (Y OR N) AND Y)
Y AND Y AND N OR ((Y) AND Y AND (Y) AND Y)
Y AND N OR ( Y AND Y AND Y AND Y)
N OR ( Y AND Y )
N OR ( Y )
Y

If the queries include implicit precedences without brackets, like N AND N OR Y AND Y where you want AND to have precedence over OR, always exhaust the possibilities to replace AND and brackets before replacing OR:

while (string length decreases) {
    while (string length decreases) {
        replace every "(Z)" by "Z"
        replace every "X AND Y" by "Z"
    }
    replace one "X OR Y" by "Z"
}

During this reduction, make sure to check whether the string length has decreased after every iteration, to avoid infinite loops caused by malformed queries.

  • 1
    The logic is very good, but then question is, will the string replace operations scale for billions of such iterations that I will have to do? I will definitely try this out. Thanks a lot for this logic. – Atom Aug 28 '15 at 19:17
  • 1
    @minion I don't speak Java, so I have no idea how efficient it is for string operations. The easiest way to write the replace operations would be with regex's, but performance-wise that would probably be a disaster. Checking whether the string length has changed with each iteration is also a bit of a time-eater. If you can make some assumptions about the queries, like whether the queries can be malformed, and especially if you know there won't be implicit precendence, then you can probably simplify things a lot. – m69's been on strike for years Aug 28 '15 at 19:38
  • Thanks a lot. Your inputs will be helpful for my design. – Atom Aug 28 '15 at 19:47
  • 1
    Don't worry, I knew from the start that Tagir Valeev would run off with the bounty :-) – m69's been on strike for years Aug 28 '15 at 22:50
1

Hint:

A possible solution is to store your Boolean condition values in a single string attribute, like "YNYNNNYNYNYN", or , better, packed as a binary integer. Then, for a given clause, generate a table of all accepted strings. A join operation will return all desired records.

You can even process several clauses in a single go by adjoining the clause number to the accepted strings when generating the table.

Even though the table size can be exponential in the number of conditions, this can remain quite manageable for a moderate number of conditions.

0

What you have is an expression written in some language that seems compliant with the grammar of the WHERE clause of SQL. So you need:

  • a parser for this language that can build an AST and then a tree of expressions
  • an engine to evaluate the expressions tree against your context (ie the environment against the names Acct1, Acct2 etc are resolved against)

This is a simple language so you can build your parser by hand, or otherwise look at ANTLR or JavaCC - and in this case I suggest you take a look at some sample (ANTLR or JavaCC) - of course, you don't need a full SQL parser! Just extract the bits you need.

An easier approach is to write the filter expression in some language that can be invoked via the Java scripting interface, like Javascript or Groovy (or Ruby, Python...). I don't suggest running a find/replace on the input text to transform the SQL-like language to the target language (for example Python has and and or operators - lowercase) as that would break easily depending on the content of the input string.

Raffaele
  • 20,627
  • 6
  • 47
  • 86
  • 1
    Thanks a lot for the post but I do not have much understanding of what you have posted. My tech stack is Java here. I need to write java logic for these clauses to filter the incoming record. I shall try to understand the things you suggested as well if I can. – Atom Aug 25 '15 at 00:07
  • The Java program you need to write is called a parser, and can be written either by hand or generated by a tool with its own language (JavaCC or ANTLR). The parser will build a tree of expressions bound to some environment (the CSV record) and then you need a program (the engine) that can evaluate the tree in a given environment – Raffaele Aug 25 '15 at 00:11
  • Okay.. Do we have any standard expressions or functions defined that can evaluate such expressions in java because writing a parser and testing it with different scenarios looks like dark magic to me!! Any example on how to proceed in writing code in java is also helpful. – Atom Aug 25 '15 at 22:56
  • 1
    @minon: _"Do we have any standard expressions or functions defined that can evaluate such expressions"_ - No, java does not come with a built in way to evaluate arbitrary languages which you invent, because there is no `java.util.MindReader` – Eric Aug 27 '15 at 21:51
  • And well-formed parenthesized expressions are beyond regular expressions, anyway. – greybeard Aug 28 '15 at 05:59
  • 1
    @Eric java.util.MindReader. I wish if it is real. Kindly ignore my ignorance. – Atom Aug 28 '15 at 19:52