I am working on a python flow script. Which basically parses SQL statements and get the target table and the source tables. Example
CREATE TABLE TAR_TAB1 AS
SELECT * FROM SRC_TAB1 JOIN SRC_TAB2
CREATE TABLE TAR_TAB2 AS
SELECT * FROM SRC_TAB3 JOIN SRC_TAB4
INSERT INTO TAR_TAB3 AS
SELECT * FROM SRC_TAB5 JOIN SRC_TAB6
I wrote regex
target = re.findall(r'\w+(?=\s+AS)', data)
source = re.findall(r'(?:FROM|JOIN)\s*([^\s]+), data)
the results are as expected,
target list contains, TAR_TAB1, TAR_TAB2, TAR_TAB3
source list contains, SRC_TAB1, SRC_TAB2, SRC_TAB3, SRC_TAB4, ETC
HOW TO GET OUTPUT LIKE EITHER LIST OR DATA FRAME LIKE, preferably dataframe, as such the flow of extraction is not disturbed.
target source
TAR_TAB1 SRC_TAB1
TAR_TAB1 SRC_TAB2
TAR_TAB2 SRC_TAB3
TAR_TAB2 SRC_TAB4
TAR_TAB3 SRC_TAB5
TAR_TAB4 SRC_TAB6