How to extract table names and column names from sql query?

Question

So let assume we have such simple query:

Select a.col1, b.col2 from tb1 as a inner join tb2 as b on tb1.col7 = tb2.col8;

The result should looks this way:

tb1 col1
tb1 col7
tb2 col2
tb2 col8

I've tried to solve this problem using some python library:

1) Even extracting only tables using sqlparse might be a huge problem. For example this official book doesn't work properly at all.

2) Using regular expression seems to be really hard to achieve.

3) But then I found this , that might help. However the problem is that I can't connect to any database and execute that query.

Any ideas?

Googling for "python sql parser" turns up this SO question: http://stackoverflow.com/questions/1394998/parsing-sql-with-python — PaulMcG, Feb 25 '16 at 19:04

Dylan Hogg · Answer 1 · 2021-10-21T23:45:48.190

57

sql-metadata is a Python library that uses a tokenized query returned by python-sqlparse and generates query metadata.

This metadata can return column and table names from your supplied SQL query. Here are a couple of example from the sql-metadata github readme:

>>> sql_metadata.get_query_columns("SELECT test, id FROM foo, bar")
[u'test', u'id']

>>> sql_metadata.get_query_tables("SELECT test, id FROM foo, bar")
[u'foo', u'bar']

>>> sql_metadata.get_query_limit_and_offset('SELECT foo_limit FROM bar_offset LIMIT 50 OFFSET 1000')
(50, 1000)

A hosted version of the library exists at sql-app.infocruncher.com to see if it works for you.

edited Oct 21 '21 at 23:45

answered Jun 03 '20 at 08:22

Dylan Hogg

3,118
29
26

2

The package is not quite perfect currently but very close. It's being actively worked on, though, and was the best solution I could find.. – k13 Aug 21 '20 at 13:09
1

on my side, It unfortunately didn't supported our "CREATE VIEW AS SELECT" with neseted views / Join ... – tdebroc Mar 12 '21 at 10:29
@tdebroc are you able to create an issue for your case in the sql-metadata github repo? – Dylan Hogg Mar 16 '21 at 01:22
@DylanHogg, it just took me time, to simplify the query and anonymse the fields, but here it is: https://github.com/macbre/sql-metadata/issues/112 For info, on 80 of my queries, results are wrong for 27 I ended up writing my own two algorithms, I can share if you want, but they look like the one from jan (which I haven't tested) – tdebroc Mar 19 '21 at 17:37
1

@tdebroc fantastic - it looks like the sql-metadata author is looking into it :) I'll check it out as well if I find time. – Dylan Hogg Mar 22 '21 at 00:29
1

This should be the selected answer, sql_metadata module is brilliant. Thanks for contributing this @DylanHogg – synthesizerpatel Oct 14 '21 at 19:57
@synthesizerpatel agreed sql_metadata is very good, glad it works for you. Credit to macbre for the library! It's worth mentioning that anyone can test the library online at https://sql-app.infocruncher.com/ to see if it works for you – Dylan Hogg Oct 14 '21 at 23:55
this is the best sql parser package i've ever seen! – uniquegino Nov 11 '22 at 00:13
sql-app.infocruncher.com This cannot read SQL, keep saying "Oops, I didn't understand that. Please check you entered valid SQL" although I have imput simpliest query (with no union, with table) only select from – Tom Tom Dec 28 '22 at 07:54
@TomTom could you open an issue at https://github.com/dylanhogg/sql-app with your query and I'll diagnose. Under the hood it's using the sql-metadata package, so most likely replicates an issue there which would be good to fix. – Dylan Hogg Dec 31 '22 at 09:30
This is better parse https://www.activequerybuilder.com/analyzer.html – Tom Tom Jan 03 '23 at 07:31

score 6 · Accepted Answer · answered Feb 25 '16 at 14:29

Really, this is no easy task. You could use a lexer (ply in this example) and define several rules to get several tokens out of a string. The following code defines these rules for the different parts of your SQL string and puts them back together as there could be aliases in the input string. As a result, you get a dictionary (result) with the different tablenames as key.

import ply.lex as lex, re

tokens = (
    "TABLE",
    "JOIN",
    "COLUMN",
    "TRASH"
)

tables = {"tables": {}, "alias": {}}
columns = []

t_TRASH = r"Select|on|=|;|\s+|,|\t|\r"

def t_TABLE(t):
    r"from\s(\w+)\sas\s(\w+)"

    regex = re.compile(t_TABLE.__doc__)
    m = regex.search(t.value)
    if m is not None:
        tbl = m.group(1)
        alias = m.group(2)
        tables["tables"][tbl] = ""
        tables["alias"][alias] = tbl

    return t

def t_JOIN(t):
    r"inner\s+join\s+(\w+)\s+as\s+(\w+)"

    regex = re.compile(t_JOIN.__doc__)
    m = regex.search(t.value)
    if m is not None:
        tbl = m.group(1)
        alias = m.group(2)
        tables["tables"][tbl] = ""
        tables["alias"][alias] = tbl
    return t

def t_COLUMN(t):
    r"(\w+\.\w+)"

    regex = re.compile(t_COLUMN.__doc__)
    m = regex.search(t.value)
    if m is not None:
        t.value = m.group(1)
        columns.append(t.value)
    return t

def t_error(t):
    raise TypeError("Unknown text '%s'" % (t.value,))
    t.lexer.skip(len(t.value))

# here is where the magic starts
def mylex(inp):
    lexer = lex.lex()
    lexer.input(inp)

    for token in lexer:
        pass

    result = {}
    for col in columns:
        tbl, c = col.split('.')
        if tbl in tables["alias"].keys():
            key = tables["alias"][tbl]
        else:
            key = tbl

        if key in result:
            result[key].append(c)
        else:
            result[key] = list()
            result[key].append(c)

    print result
    # {'tb1': ['col1', 'col7'], 'tb2': ['col2', 'col8']}    

string = "Select a.col1, b.col2 from tb1 as a inner join tb2 as b on tb1.col7 = tb2.col8;"
mylex(string)

You're right, this task is really hard. But it seems to be kind of popular and I believed that it has been solved. — Rocketq, Feb 25 '16 at 15:05
I get this error: TypeError: is a built-in module (PS. I'm not expert in Python) — Momog, May 28 '17 at 17:09
Should warn that this script needs to be run as a file and also it doesn't work unless you have very basic SQL strings to parse. — John Drinane, Dec 23 '19 at 17:46

score 5 · Answer 3 · edited Nov 11 '20 at 06:31

I am tackling a similar problem and found a simpler solution and it seems to work well.

import re

def tables_in_query(sql_str):

    # remove the /* */ comments
    q = re.sub(r"/\*[^*]*\*+(?:[^*/][^*]*\*+)*/", "", sql_str)

    # remove whole line -- and # comments
    lines = [line for line in q.splitlines() if not re.match("^\s*(--|#)", line)]

    # remove trailing -- and # comments
    q = " ".join([re.split("--|#", line)[0] for line in lines])

    # split on blanks, parens and semicolons
    tokens = re.split(r"[\s)(;]+", q)

    # scan the tokens. if we see a FROM or JOIN, we set the get_next
    # flag, and grab the next one (unless it's SELECT).

    tables = set()
    get_next = False
    for tok in tokens:
        if get_next:
            if tok.lower() not in ["", "select"]:
                tables.add(tok)
            get_next = False
        get_next = tok.lower() in ["from", "join"]

    dictTables = dict()
    for table in tables:
        fields = []
        for token in tokens:
            if token.startswith(table):
                if token != table:
                    fields.append(token)
        if len(list(set(fields))) >= 1:
        dictTables[table] = list(set(fields))
    return dictTables

code adapted from https://grisha.org/blog/2016/11/14/table-names-from-sql/

The code in the URL works smoothly while yours needs some debugging — uniquegino, Apr 30 '20 at 01:11

score 5 · Answer 4 · answered Sep 14 '20 at 06:49

moz-sql-parser is a python library to convert some subset of SQL-92 queries to JSON-izable parse trees. Maybe it what you want.

Here is an example.

>>> parse("SELECT id,name FROM dual WHERE id>3 and id<10 ORDER BY name")
{'select': [{'value': 'id'}, {'value': 'name'}], 'from': 'dual', 'where': {'and': [{'gt': ['id', 3]}, {'lt': ['id', 10]}]}, 'orderby': {'value': 'name'}}

score 2 · Answer 5 · answered Sep 04 '20 at 23:04

2

Create a list of all the tables that are present in the DB. You can then search each table name in the queries. This obviously isn't foolproof and the code will break in case any column/alias name matches the table name. But it can be done as a workaround.

answered Sep 04 '20 at 23:04

Aman

21
1

The question is to the extract the table names and columns. Please check Jan's response. – Siva Karthikeyan Sep 05 '20 at 00:12

score 2 · Answer 6 · answered Mar 19 '21 at 12:37

import pandas as pd
#%config PPMagics.autolimit=0


#txt = """<your SQL text here>"""
txt_1 = txt
replace_list = ['\n', '(', ')', '*', '=','-',';','/','.']
count = 0
for i in replace_list:
    txt_1 = txt_1.replace(i, ' ')
txt_1 = txt_1.split()
res = []
for i in range(1, len(txt_1)):
    if txt_1[i-1].lower() in ['from', 'join','table'] and txt_1[i].lower() != 'select': 
        count +=1
        str_count = str(count)
        res.append(txt_1[i] + "." + txt_1[i+1])


#df.head()
res_l = res
f_res_l = []
for i in range(0,len(res_l)):
    if len(res_l[i]) > 15 : # change it to 0 is you want all the caught strings
        f_res_l.append(res_l[i])
    else :
        pass

All_Table_List = f_res_l
print("All the unique tables from the SQL text, in the order of their appearence in the code : \n",100*'*')
df = pd.DataFrame(All_Table_List,columns=['Tables_Names'])
df.reset_index(level=0, inplace=True)
list_=list(df["Tables_Names"].unique())
df_1_Final = pd.DataFrame(list_,columns=['Tables_Names'])
df_1_Final.reset_index(level=0, inplace=True)
df_1_Final

Only code without any explanation in an answer is useless, especially for really old question that has accepted answer and also a bountied one. Explain why your answer is better than the others. — Toto, Mar 19 '21 at 12:44

score 1 · Answer 7 · answered Aug 23 '20 at 07:55

Unfortunately, in order to do this successfully for "complex SQL" queries, you will more or less have to implement a complete parser for the particular database engine you are using.

As an example, consider this very basic complex query:

WITH a AS (
    SELECT col1 AS c FROM b
)
SELECT c FROM a

In this case, a is not a table but a common table expression (CTE), and should be excluded from your output. There's no simple way of using regexp:es to realize that b is a table access but a is not - your code will really have to understand the SQL at a deeper level.

Also consider

SELECT * FROM tbl

You'd have to know the column names actually present in a particular instance of a database (and accessible to a particular user, too) to answer that correctly.

If by "works with complex SQL" you mean that it must work with any valid SQL statement, you also need to specify for which SQL dialect - or implement dialect-specific solutions. A solution which works with any SQL handled by a database that does not implement CTE:s would not work in one that does.

I am sorry to say so, but I do not think you will find a complete solution which works for arbitrarily complex SQL queries. You'll have to settle for a solution which works with a subset of a particular SQL-dialect.

score 1 · Answer 8 · answered May 07 '21 at 13:11

1

For my simple use case (one table in query, no joins), I used the following tweak

lst = "select * from table".split(" ")
lst = [item for item in lst if len(item)>0]
table_name = lst[lst.index("from")+1]

answered May 07 '21 at 13:11

Idhem

880
1
9
22

How to extract table names and column names from sql query?

8 Answers8

Linked