0

I need an opinion about parsing SQL (Oracle) with regex.

There are pretty entertaining posts in StackOverflow that explain that regex over HTML or XML is hopeless (example).

Is that the case with SQL as well? And even if the answer is no, is the specific problem below solvable with regex? Last fallback, what else would work?

Requirement: Verify that the table names in arbitrary SQL are enclosed in square brackets (should be "[name]").

Constraint: Use ksh/bash, i.e. grep|sed|awk.

For the SQL below the script should flag bad_table as not matching (should have been [bad_table]). Should work on arbitrary SQL, here a simpler one without joins.

SELECT id
FROM [order]
WHERE id IN (SELECT oid
             FROM [audit] a, bad_table d, (SELECT xid FROM [t1]) r
             WHERE a.data = r.data and a.data = d.data)
Adrian Rosoga
  • 452
  • 4
  • 9
  • 1
    Which [DBMS](https://en.wikipedia.org/wiki/DBMS) product are you using? "SQL" is just a query language, not the name of a specific database product. Please add a [tag](https://stackoverflow.com/help/tagging) for the database product you are using `postgresql`, `oracle`, `sql-server`, `db2`, ... –  Sep 14 '18 at 13:49
  • 1
    For the specific problem of extracting table names from SQL with no comments, the problem is probably solvable. – Gordon Linoff Sep 14 '18 at 13:54
  • 1
    "*Is that the case with SQL as well?*" Yes, because SQL allows arbitrarily nested constructs (e.g. with `(` `)`). – melpomene Sep 14 '18 at 13:59
  • 1
    How would you want to handle common table expressions? E.g., `WITH cte AS ( SELECT ... ) SELECT ... FROM cte`? In this case `cte` is not an object in your database, but I think you'll be hard-pressed to keep your script from thinking it is a table. I won't say your requirements are impossible, but they're definitely impractical. – Matthew McPeak Sep 14 '18 at 15:34
  • 1
    This is tagged with Oracle, square brackets is a SQL Server (and possibly other databases) thing and is not allowed in Oracle. – Brian Leach Sep 14 '18 at 16:12
  • Another consideration is unaliased table names, e.g. instead of say `select a.id from [audit] a` you might have `select audit.id from [audit]`, so should the plain `audit` table reference in the select list (or anywhere else in the query) be flagged too? – Alex Poole Sep 14 '18 at 16:51
  • 1
    Do I dare ask why you need to do this? Maybe there is a more straightforward way to do what you want (eg, you can parse SQL in Oracle ) – tbone Sep 14 '18 at 18:45
  • @tbone & Brian - I should have mentioned from the beginning why such processing. The SQL files are input to a tool that processes them in a specific way. That tool requires tables in this specific format. At the end of the processing chain the schema is created in Oracle. – Adrian Rosoga Sep 16 '18 at 09:11

1 Answers1

0

I have done more reading and it is clear now that the task is not possible with regex. The theory says that - SQL is non-regular language and regex is not up to this task. A SQL parser is required.

That being said, a compromise has been reached as something had to be done at work: use a collection of regular expressions (2 as of now) that work on the ~300 SQL files that need processing as of today. Admit that one day someone will come up with a SQL that would fail. Add for that occasion a comment in the code that explains why a tweak is needed to cover that corner case and, best, use a SQL parser...

Many thanks to everyone who contributed.

Adrian Rosoga
  • 452
  • 4
  • 9