2

In the following code I need to get table names as tokens using SQL reserved words as separators. The tokens should contain the table name or the table name followed by a suffix.

For example, given

table1 t1 inner join table2 t2 outer join table3

The code should return three tokens:

Token 1: table1 t1
Token 2: table2 t2 
Token 3: table3

This code instead uses the first reserved word as token, without discarding any other following reserved words:

    String str = "table1 t1 inner join table2 t2 outer join table3";
    String [] tokens = sql2.split("\\son\\s|\\sinner\\s|\\souter\\s|\\sjoin\\s");
    
    for (int i = 0; i<tokens.length; i++)
         System.out.println("Token "+(i+1)+":"+tokens[i]);

This returns:

Token 1:table1 t1
Token 2:join table2 t2 
Token 3:join table3

What is the problem and how to make this work?

ps0604
  • 1,227
  • 23
  • 133
  • 330
  • This is probably going to be more complex than it seems. Regular expressions are not going to be powerful enough for the job. What should happen if you see `...inner join table1 on table1.value = table2.value and table1.value = 'inner value'`. At any rate, you're not going to want to split, because you don't want every word that isn't a reserved word, you just want the word that follows it. This might require a library, but I'm afraid I don't have on to recommend to you. – mypetlion Jul 31 '20 at 21:02
  • You are right, anything that is not a table, including things such as table1.column1 should be omitted – ps0604 Jul 31 '20 at 21:06
  • https://stackoverflow.com/questions/16616034/jdbc-get-table-names-involved-in-a-sql-query – rootkonda Jul 31 '20 at 21:06
  • Using the JDBC result set didn’t work for me – ps0604 Jul 31 '20 at 21:07
  • 1
    It is a lot harder than you think. I used ANTLR4 a few years back to analyse static SQL from a DB2 catalog, and you need a proper SQL grammar to generate a parser. After which, it isn't too bad... – nullTerminator Jul 31 '20 at 21:17
  • To further expand mypetition's comment "as" can optionally be used when aliasing a table, and inner/outer can also use left, right, or full ... `select a.col, b.col, a.col2 from table1 as a left inner join table2 as b on table1.val = table2.val` — regex can be used as _part_ of a parser, but regex is not sufficient _**as**_ a parser. – Stephen P Jul 31 '20 at 22:15

4 Answers4

2

This works for the general case for a series of joined tables:

String[] tableNames = str.split(" (?=inner|outer|left|join|right|cross|full).*?join ");
Bohemian
  • 412,405
  • 93
  • 575
  • 722
1

You can use (?:\w+\s+){1,2}(?=inner|outer join)|(?<=inner join)(?:\s+\w+){1,2}|(?<=outer join)(?:\s+\w+){1,2} as the regex.

Demo:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        // Test strings
        String[] arr = { "table1 t1 inner join table2 t2 outer join table3",
                "table1 t1 inner join table2 t2 outer join table3 t3 inner join table2" };
        Pattern pattern = Pattern.compile(
                "(?:\\w+\\s+){1,2}(?=inner|outer join)|(?<=inner join)(?:\\s+\\w+){1,2}|(?<=outer join)(?:\\s+\\w+){1,2}");
        for (String s : arr) {
            System.out.println("Processing: " + s);
            Matcher matcher = pattern.matcher(s);
            while (matcher.find()) {
                System.out.println(matcher.group().trim());
            }
        }
    }
}

Output:

Processing: table1 t1 inner join table2 t2 outer join table3
table1 t1
table2 t2
table3
Processing: table1 t1 inner join table2 t2 outer join table3 t3 inner join table2
table1 t1
table2 t2
table3 t3
table2
Arvind Kumar Avinash
  • 71,965
  • 6
  • 74
  • 110
  • Fails in this case: https://regex101.com/r/YfJah6/1 – mypetlion Jul 31 '20 at 22:53
  • @mypetlion - You've made a mistake in your string. As per the requirement, `join table3` should be `join table3 t3`. I've extended the regex from my original answer to cover your string (once you correct it) as well. – Arvind Kumar Avinash Jul 31 '20 at 23:48
0

I've tested your pattern (unescaped, it's \son\s|\sinner\s|\souter\s|\sjoin\s) against this test string: table1 t1 inner join table2 t2 outer join table3 on regex101.com and the only match I got is for inner and outer. So since you're splitting the string by these tokens, you get your result.

Perhaps this can help you for your specific case. I have went for a regex approach, instead of splitting the data.

public class PatternChecker {

    public static void main(String[] args) {
        String str = "table1 t1 inner join table2 t2 outer join table3";
        Pattern p = Pattern.compile("(table[0-9]+( [a-zA-Z0-9]+ )?)");
        Matcher m = p.matcher(str);

        while(m.find()) {
            System.out.println(m.group(0));
        }
    }

}

Later edit

The split pattern \\son\\s|\\sinner\\s|\\souter\\s|\\sjoin\\s did not work because of the mandatory whitespaces used. For instance, you are searching for *on* or *inner* or *outer* or *join* (whitespaces are marked with an asterisk). The whitespaces are part of the keywords you're splitting with. *join* could not be matched since its left-side whitespace was already picked up by the *outer* and *inner* right-side whitespace matches.

Going back to the split solution, one fix would be to mark the left-side whitespace of join as optional via the ? quantifier; this would be the new pattern: \\son\\s|\\sinner\\s|\\souter\\s|\\s?join\\s. This yields some empty tokens that can be filtered out

Another idea would be to consider aggregations using join (i.e. inner join, outer join) as full search criteria, which would lead to \\son\\s|\\sinner join\\s|\\souter join\\s. No empty tokens are generated.

public class PatternChecker {

    public static void main(String[] args) {
        String str = "employee t1 inner join department t2 outer join job join table4 history on a=b";

        String[] tokens = str.split("\\son\\s|\\sinner join\\s|\\souter join\\s|\\sjoin\\s");

        for(String token : tokens) {
            System.out.println(token);
        }

        // Output
        // employee t1
        // department t2
        // job
        // table4 history
        // a=b

    }

}

Note that, since you're also including on, you can filter out all the matched tokens containing the equals symbol.

For a generic fix, you would need to isolate the string contained between from and where and apply the idea above.


x sylver
  • 109
  • 5
  • 3
    Wouldn't this only work if the table name starts with the word table? I think OP is trying to determine what words are table identifiers based on their context in the SQL statement. – Badr B Jul 31 '20 at 21:43
  • @Badr Of course, that is why I have specified that this works for the specific case OP was talking about. I will try to think of a solution covering any table name and update my answer. – x sylver Jul 31 '20 at 21:50
  • 1
    Fails on this case: https://regex101.com/r/Sr6GzI/1 – mypetlion Jul 31 '20 at 22:55
  • Indeed, missed that possibility. In the meantime, I gave up on using regex and extended OP's idea. – x sylver Jul 31 '20 at 23:14
  • 2
    @xsylver - You've done a commendable job. Do not get demotivated by the d o w n v o t e s. Keep it up! – Arvind Kumar Avinash Jul 31 '20 at 23:18
  • 1
    @Arvind Thank you for the kind words :) – x sylver Jul 31 '20 at 23:36
0

I found this question that is essentially the same as yours.

You can mostly solve this problem by going through a regex nightmare, or you can try to use an external library like Zql which actually parses the SQL statement for you.

Badr B
  • 998
  • 1
  • 9
  • 17