28

So let assume we have such simple query:

Select a.col1, b.col2 from tb1 as a inner join tb2 as b on tb1.col7 = tb2.col8;

The result should looks this way:

tb1 col1
tb1 col7
tb2 col2
tb2 col8

I've tried to solve this problem using some python library:

1) Even extracting only tables using sqlparse might be a huge problem. For example this official book doesn't work properly at all.

2) Using regular expression seems to be really hard to achieve.

3) But then I found this , that might help. However the problem is that I can't connect to any database and execute that query.

Any ideas?

Community
  • 1
  • 1
Rocketq
  • 5,423
  • 23
  • 75
  • 126
  • Googling for "python sql parser" turns up this SO question: http://stackoverflow.com/questions/1394998/parsing-sql-with-python – PaulMcG Feb 25 '16 at 19:04

8 Answers8

57

sql-metadata is a Python library that uses a tokenized query returned by python-sqlparse and generates query metadata.

This metadata can return column and table names from your supplied SQL query. Here are a couple of example from the sql-metadata github readme:

>>> sql_metadata.get_query_columns("SELECT test, id FROM foo, bar")
[u'test', u'id']

>>> sql_metadata.get_query_tables("SELECT test, id FROM foo, bar")
[u'foo', u'bar']

>>> sql_metadata.get_query_limit_and_offset('SELECT foo_limit FROM bar_offset LIMIT 50 OFFSET 1000')
(50, 1000)

A hosted version of the library exists at sql-app.infocruncher.com to see if it works for you.

Dylan Hogg
  • 3,118
  • 29
  • 26
  • 2
    The package is not quite perfect currently but very close. It's being actively worked on, though, and was the best solution I could find.. – k13 Aug 21 '20 at 13:09
  • 1
    on my side, It unfortunately didn't supported our "CREATE VIEW AS SELECT" with neseted views / Join ... – tdebroc Mar 12 '21 at 10:29
  • @tdebroc are you able to create an issue for your case in the sql-metadata github repo? – Dylan Hogg Mar 16 '21 at 01:22
  • @DylanHogg, it just took me time, to simplify the query and anonymse the fields, but here it is: https://github.com/macbre/sql-metadata/issues/112 For info, on 80 of my queries, results are wrong for 27 I ended up writing my own two algorithms, I can share if you want, but they look like the one from jan (which I haven't tested) – tdebroc Mar 19 '21 at 17:37
  • 1
    @tdebroc fantastic - it looks like the sql-metadata author is looking into it :) I'll check it out as well if I find time. – Dylan Hogg Mar 22 '21 at 00:29
  • 1
    This should be the selected answer, sql_metadata module is brilliant. Thanks for contributing this @DylanHogg – synthesizerpatel Oct 14 '21 at 19:57
  • @synthesizerpatel agreed sql_metadata is very good, glad it works for you. Credit to macbre for the library! It's worth mentioning that anyone can test the library online at https://sql-app.infocruncher.com/ to see if it works for you – Dylan Hogg Oct 14 '21 at 23:55
  • this is the best sql parser package i've ever seen! – uniquegino Nov 11 '22 at 00:13
  • sql-app.infocruncher.com This cannot read SQL, keep saying "Oops, I didn't understand that. Please check you entered valid SQL" although I have imput simpliest query (with no union, with table) only select from – Tom Tom Dec 28 '22 at 07:54
  • @TomTom could you open an issue at https://github.com/dylanhogg/sql-app with your query and I'll diagnose. Under the hood it's using the sql-metadata package, so most likely replicates an issue there which would be good to fix. – Dylan Hogg Dec 31 '22 at 09:30
  • This is better parse https://www.activequerybuilder.com/analyzer.html – Tom Tom Jan 03 '23 at 07:31
6

Really, this is no easy task. You could use a lexer (ply in this example) and define several rules to get several tokens out of a string. The following code defines these rules for the different parts of your SQL string and puts them back together as there could be aliases in the input string. As a result, you get a dictionary (result) with the different tablenames as key.

import ply.lex as lex, re

tokens = (
    "TABLE",
    "JOIN",
    "COLUMN",
    "TRASH"
)

tables = {"tables": {}, "alias": {}}
columns = []

t_TRASH = r"Select|on|=|;|\s+|,|\t|\r"

def t_TABLE(t):
    r"from\s(\w+)\sas\s(\w+)"

    regex = re.compile(t_TABLE.__doc__)
    m = regex.search(t.value)
    if m is not None:
        tbl = m.group(1)
        alias = m.group(2)
        tables["tables"][tbl] = ""
        tables["alias"][alias] = tbl

    return t

def t_JOIN(t):
    r"inner\s+join\s+(\w+)\s+as\s+(\w+)"

    regex = re.compile(t_JOIN.__doc__)
    m = regex.search(t.value)
    if m is not None:
        tbl = m.group(1)
        alias = m.group(2)
        tables["tables"][tbl] = ""
        tables["alias"][alias] = tbl
    return t

def t_COLUMN(t):
    r"(\w+\.\w+)"

    regex = re.compile(t_COLUMN.__doc__)
    m = regex.search(t.value)
    if m is not None:
        t.value = m.group(1)
        columns.append(t.value)
    return t

def t_error(t):
    raise TypeError("Unknown text '%s'" % (t.value,))
    t.lexer.skip(len(t.value))

# here is where the magic starts
def mylex(inp):
    lexer = lex.lex()
    lexer.input(inp)

    for token in lexer:
        pass

    result = {}
    for col in columns:
        tbl, c = col.split('.')
        if tbl in tables["alias"].keys():
            key = tables["alias"][tbl]
        else:
            key = tbl

        if key in result:
            result[key].append(c)
        else:
            result[key] = list()
            result[key].append(c)

    print result
    # {'tb1': ['col1', 'col7'], 'tb2': ['col2', 'col8']}    

string = "Select a.col1, b.col2 from tb1 as a inner join tb2 as b on tb1.col7 = tb2.col8;"
mylex(string)
Jan
  • 42,290
  • 8
  • 54
  • 79
  • You're right, this task is really hard. But it seems to be kind of popular and I believed that it has been solved. – Rocketq Feb 25 '16 at 15:05
  • I get this error: TypeError: is a built-in module (PS. I'm not expert in Python) – Momog May 28 '17 at 17:09
  • Should warn that this script needs to be run as a file and also it doesn't work unless you have very basic SQL strings to parse. – John Drinane Dec 23 '19 at 17:46
5

I am tackling a similar problem and found a simpler solution and it seems to work well.

import re

def tables_in_query(sql_str):

    # remove the /* */ comments
    q = re.sub(r"/\*[^*]*\*+(?:[^*/][^*]*\*+)*/", "", sql_str)

    # remove whole line -- and # comments
    lines = [line for line in q.splitlines() if not re.match("^\s*(--|#)", line)]

    # remove trailing -- and # comments
    q = " ".join([re.split("--|#", line)[0] for line in lines])

    # split on blanks, parens and semicolons
    tokens = re.split(r"[\s)(;]+", q)

    # scan the tokens. if we see a FROM or JOIN, we set the get_next
    # flag, and grab the next one (unless it's SELECT).

    tables = set()
    get_next = False
    for tok in tokens:
        if get_next:
            if tok.lower() not in ["", "select"]:
                tables.add(tok)
            get_next = False
        get_next = tok.lower() in ["from", "join"]

    dictTables = dict()
    for table in tables:
        fields = []
        for token in tokens:
            if token.startswith(table):
                if token != table:
                    fields.append(token)
        if len(list(set(fields))) >= 1:
        dictTables[table] = list(set(fields))
    return dictTables

code adapted from https://grisha.org/blog/2016/11/14/table-names-from-sql/

Faiz
  • 5,331
  • 10
  • 45
  • 57
Matt Camp
  • 1,448
  • 3
  • 17
  • 38
5

moz-sql-parser is a python library to convert some subset of SQL-92 queries to JSON-izable parse trees. Maybe it what you want.

Here is an example.

>>> parse("SELECT id,name FROM dual WHERE id>3 and id<10 ORDER BY name")
{'select': [{'value': 'id'}, {'value': 'name'}], 'from': 'dual', 'where': {'and': [{'gt': ['id', 3]}, {'lt': ['id', 10]}]}, 'orderby': {'value': 'name'}}
return long
  • 183
  • 2
  • 9
2

Create a list of all the tables that are present in the DB. You can then search each table name in the queries. This obviously isn't foolproof and the code will break in case any column/alias name matches the table name. But it can be done as a workaround.

Aman
  • 21
  • 1
2
import pandas as pd
#%config PPMagics.autolimit=0


#txt = """<your SQL text here>"""
txt_1 = txt
replace_list = ['\n', '(', ')', '*', '=','-',';','/','.']
count = 0
for i in replace_list:
    txt_1 = txt_1.replace(i, ' ')
txt_1 = txt_1.split()
res = []
for i in range(1, len(txt_1)):
    if txt_1[i-1].lower() in ['from', 'join','table'] and txt_1[i].lower() != 'select': 
        count +=1
        str_count = str(count)
        res.append(txt_1[i] + "." + txt_1[i+1])


#df.head()
res_l = res
f_res_l = []
for i in range(0,len(res_l)):
    if len(res_l[i]) > 15 : # change it to 0 is you want all the caught strings
        f_res_l.append(res_l[i])
    else :
        pass

All_Table_List = f_res_l
print("All the unique tables from the SQL text, in the order of their appearence in the code : \n",100*'*')
df = pd.DataFrame(All_Table_List,columns=['Tables_Names'])
df.reset_index(level=0, inplace=True)
list_=list(df["Tables_Names"].unique())
df_1_Final = pd.DataFrame(list_,columns=['Tables_Names'])
df_1_Final.reset_index(level=0, inplace=True)
df_1_Final
  • 2
    Only code without any explanation in an answer is useless, especially for really old question that has accepted answer and also a bountied one. Explain why your answer is better than the others. – Toto Mar 19 '21 at 12:44
1

Unfortunately, in order to do this successfully for "complex SQL" queries, you will more or less have to implement a complete parser for the particular database engine you are using.

As an example, consider this very basic complex query:

WITH a AS (
    SELECT col1 AS c FROM b
)
SELECT c FROM a

In this case, a is not a table but a common table expression (CTE), and should be excluded from your output. There's no simple way of using regexp:es to realize that b is a table access but a is not - your code will really have to understand the SQL at a deeper level.

Also consider

SELECT * FROM tbl

You'd have to know the column names actually present in a particular instance of a database (and accessible to a particular user, too) to answer that correctly.

If by "works with complex SQL" you mean that it must work with any valid SQL statement, you also need to specify for which SQL dialect - or implement dialect-specific solutions. A solution which works with any SQL handled by a database that does not implement CTE:s would not work in one that does.

I am sorry to say so, but I do not think you will find a complete solution which works for arbitrarily complex SQL queries. You'll have to settle for a solution which works with a subset of a particular SQL-dialect.

1

For my simple use case (one table in query, no joins), I used the following tweak

lst = "select * from table".split(" ")
lst = [item for item in lst if len(item)>0]
table_name = lst[lst.index("from")+1]
Idhem
  • 880
  • 1
  • 9
  • 22