6

I am working on a python flow script. Which basically parses SQL statements and get the target table and the source tables. Example

CREATE TABLE TAR_TAB1 AS
SELECT * FROM SRC_TAB1 JOIN SRC_TAB2

CREATE TABLE TAR_TAB2 AS 
SELECT * FROM SRC_TAB3 JOIN SRC_TAB4

INSERT INTO TAR_TAB3 AS
SELECT * FROM SRC_TAB5 JOIN SRC_TAB6

I wrote regex

target = re.findall(r'\w+(?=\s+AS)', data)
source = re.findall(r'(?:FROM|JOIN)\s*([^\s]+), data)

the results are as expected,

target list contains, TAR_TAB1, TAR_TAB2, TAR_TAB3
source list contains, SRC_TAB1, SRC_TAB2, SRC_TAB3, SRC_TAB4, ETC

HOW TO GET OUTPUT LIKE EITHER LIST OR DATA FRAME LIKE, preferably dataframe, as such the flow of extraction is not disturbed.

target         source
TAR_TAB1       SRC_TAB1
TAR_TAB1       SRC_TAB2
TAR_TAB2       SRC_TAB3
TAR_TAB2       SRC_TAB4
TAR_TAB3       SRC_TAB5
TAR_TAB4       SRC_TAB6
Sai Astro
  • 113
  • 1
  • 7

3 Answers3

1

Here's a solution:

targets = re.findall(r'(?:CREATE\s+TABLE|INSERT\s+INTO)\s+([a-z0-9A-Z_]+)\s+AS', text)
sources = re.findall(r'SELECT\s+\*\s+FROM\s([a-z0-9A-Z_]+)\s+JOIN\s+([a-z0-9A-Z_]+)', text)

# each target has multiple sources, so repeat each target n times per 
# number of sources.
lens = [len(src) for src in sources]
targets = np.repeat(targets, lens) 

# 'flatten' the list of sources from [(t1, t2), (t3, t4)] to 
# [t1, t2, t3, t4]
sources = [tab for exp in sources for tab in exp]

pd.DataFrame({"src": sources, "tgt": targets})

Results:

        src       tgt
0  SRC_TAB1  TAR_TAB1
1  SRC_TAB2  TAR_TAB1
2  SRC_TAB3  TAR_TAB2
3  SRC_TAB4  TAR_TAB2
4  SRC_TAB5  TAR_TAB3
5  SRC_TAB6  TAR_TAB3
Roy2012
  • 11,755
  • 2
  • 22
  • 35
1

This approach uses pyparsing to actually parse the SQL statements (within the subset syntax you've shown in your examples):

import pyparsing as pp
ppc = pp.pyparsing_common


ident = ppc.identifier

CREATE, INSERT, INTO, TABLE, AS, SELECT, FROM, JOIN = \
    map(pp.CaselessKeyword, "CREATE INSERT INTO TABLE AS SELECT FROM JOIN".split())

select_stmt = (SELECT
               + (pp.delimitedList(ident) | '*')("columns")
               + FROM
               + ((ident + JOIN.suppress() + ident) | ident)("tables"))

src_target_stmt = ((CREATE + TABLE | INSERT + INTO)("action")
                   + ident("target")
                   + AS
                   + pp.Group(select_stmt)("source"))


tests = """
    CREATE TABLE TAR_TAB1 AS SELECT * FROM SRC_TAB1 JOIN SRC_TAB2
    CREATE TABLE TAR_TAB2 AS SELECT * FROM SRC_TAB3 JOIN SRC_TAB4
    INSERT INTO TAR_TAB3 AS SELECT COL1,COL2 FROM SRC_TAB5 JOIN SRC_TAB6
    """

# useful for debugging
#src_target_stmt.runTests(tests)

# dump parsed values out as CSV output
for t in tests.splitlines():
    if not t.strip():
        continue
    result = src_target_stmt.parseString(t)
    target = result.target
    action = result.action[0]
    for src in result.source.tables:
        print("{},{},{}".format(action, target, src))

Prints:

CREATE,TAR_TAB1,SRC_TAB1
CREATE,TAR_TAB1,SRC_TAB2
CREATE,TAR_TAB2,SRC_TAB3
CREATE,TAR_TAB2,SRC_TAB4
INSERT,TAR_TAB3,SRC_TAB5
INSERT,TAR_TAB3,SRC_TAB6

As you continue your project and you discover new requirements (needing to extract the SQL action as shown in this example, or you find other variations on SQL constructs that you need to comprehend in your parsing), extending a parser will be easier and more maintainable than extending a regex.

PaulMcG
  • 62,419
  • 16
  • 94
  • 130
0

| : A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B.

import re
import pandas as pd

Q1 = 'CREATE TABLE TAR_TAB1 AS SELECT * FROM SRC_TAB1 JOIN SRC_TAB2'
Q2 = 'CREATE TABLE TAR_TAB2 AS SELECT * FROM SRC_TAB3 JOIN SRC_TAB4'
Q3 = 'CREATE TABLE TAR_TAB3 AS SELECT * FROM SRC_TAB5 JOIN SRC_TAB6'

requests = [Q1, Q2, Q3]

target_filter = r'\w+(?=\s+AS)'
source_filter1 = r'(?<=FROM )\S+'
source_filter2 = r'(?<=JOIN )\S+'

regex_filter = target_filter + '|' + source_filter1 + '|' + source_filter2

results = [re.findall(regex_filter, Q) for Q in requests]

print(results)

# [['TAR_TAB1', 'SRC_TAB1', 'SRC_TAB2'], ['TAR_TAB2', 'SRC_TAB3', 'SRC_TAB4'], ['TAR_TAB3', 'SRC_TAB5', 'SRC_TAB6']]

To convert results list in a dataframe :

df = pd.DataFrame(results, columns =['target', 'source1', 'source2'])

     target   source1   source2
0  TAR_TAB1  SRC_TAB1  SRC_TAB2
1  TAR_TAB2  SRC_TAB3  SRC_TAB4
2  TAR_TAB3  SRC_TAB5  SRC_TAB6

Or if you want to present it differently :

l = []
for res in results:
    l.append([res[0], res[1]])
    l.append([res[0], res[2]])

df = pd.DataFrame(l, columns =['target', 'source'])

and then :

     target    source
0  TAR_TAB1  SRC_TAB1
1  TAR_TAB1  SRC_TAB2
2  TAR_TAB2  SRC_TAB3
3  TAR_TAB2  SRC_TAB4
4  TAR_TAB3  SRC_TAB5
5  TAR_TAB3  SRC_TAB6
Laurent B.
  • 1,653
  • 1
  • 7
  • 16