1

Goal: To access a database of hundreds of tables, iterate through each table's columns to find a specified column name and store the name of the table in a list.

Overview: I setup access to a database and stored the names of all the tables in a variable.

import pandas
import pyodbc

main_path = 'my_database_name'

with pyodbc.connect("DRIVER={SQL Server};" +
    "SERVER=MyServerName;" +
    f"DATABASE={main_path};" +
    "Trusted_Connection=yes;") as main_conn:

db_table_names = [_.table_name for _ in main_conn.cursor().tables()]

I then tried to iterate through the list of table names, retrieve the table and see if the column name I'm looking for is in the columns.

col_name ='whatever_col_i_need'
table_matches = [table_name if col_name in pandas.read_sql(f"SELECT * FROM {table_name }",
                con=main_conn).columns else None for table in db_tables]

Problem: I think this sucks up a lot of resources. It may be 'holding on' to each of the tables as it iterates, or is just generally slow. I thought about using a with statement so it would connect to the table, look for the column, then disconnect from the table, but I'm not sure if that's the best way, plus I don't think I'm using the right format.

found_tables = []

for table_name in db_table_names:
    with pandas.read_sql(f"SELECT * FROM {table_name}", con=main_conn) as temp_tbl:

    if 'LOCATION' in temp_tbl.columns:
        found_tables.append(table_name)

UPDATE: I figured out a method that is super fast and meets my needs perfectly.

Since I don't really care what information is in the table and just want to confirm that the column name exists, there's no need to store the table data, even if temporarily. By setting up a try/except and having a query to get a column, if the column doesn't exist in the table then it hits the except and loops.

found_tables = []

for table_name in db_tables:
    try:
        # Print statement to just follow the progress
        print(f'Checking: {table_name} for {col_name}...')
        pandas.read_sql_query(f"SELECT {col_name} FROM {table_name}", con=main_conn)
        found_tables.append(table_name)
    except Exception as e:
        print(e)
        pass
spareTimeCoder
  • 212
  • 2
  • 12
  • If you have hundreds of similar tables, this is a poor database design. Dynamic information should be in table data, not table and column names. – Barmar Apr 09 '23 at 16:36
  • @markalex The problem is not getting the column names, the problem is how to iterate through tables in a database without taking forever or sucking up resources. – spareTimeCoder Apr 09 '23 at 16:40
  • The reason your code is so slow is because you're using `SELECT *`. That's a very expensive query to execute for each table. If you just used a simple `INFORMATION_SCHEMA` query to get the column names you wouldn't have to fetch enormous amounts of data like this. – Barmar Apr 09 '23 at 16:57
  • You can do it in a single query: `SELECT TABLE_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME IN ('table1', 'table2', ...) AND COLUMN_NAME = 'LOCATION'` – Barmar Apr 09 '23 at 16:59
  • @Barmar I'll try that last suggestion of yours and see if it speeds things up. Thanks! – spareTimeCoder Apr 09 '23 at 17:23
  • @Barmar That might work in SQL, but I don't think it will in Python – spareTimeCoder Apr 09 '23 at 17:36
  • Definitely not a duplicate. – NL23codes Apr 09 '23 at 18:28
  • 1
    @spareTimeCoder Why won't it work in Python? It sends the query to SQL. – Barmar Apr 09 '23 at 20:11
  • @Barmar The information_schema part doesn't work. I figured out a solution anyway. Thanks though. – spareTimeCoder Apr 10 '23 at 11:12
  • I'm not sure why it doesn't work, but I've reopened so you can post your solution as an answer. – Barmar Apr 10 '23 at 14:56

1 Answers1

0

The key really is not saving the query results to a variable, especially when not knowing how large the table could be in the response. Secondly, the try/except was an easy way of handling whether or not the table was found to have the column name, since it could be assumed that the query would fail if it wasn't there, so a simple pass did the trick. May seem inelegant but it iterated through hundreds of million-plus line tables in seconds and returned the results of the found tables.

import pandas
import pyodbc

main_path = 'my_database_name'

with pyodbc.connect("DRIVER={SQL Server};" +
    "SERVER=MyServerName;" +
    f"DATABASE={main_path};" +
    "Trusted_Connection=yes;") as main_conn:

db_table_names = [_.table_name for _ in main_conn.cursor().tables()]
found_tables = []
col_name = 'SomeCol'

for table_name in db_tables:
    try:
        # Print statement to just follow the progress
        print(f'Checking: {table_name} for {col_name}...')
        pandas.read_sql_query(f"SELECT {col_name} FROM {table_name}", con=main_conn)
        found_tables.append(table_name)
    except Exception as e:
        # Print statement can be omitted if there's just too much printing out
        print(e)
        pass

print(f"Found Column name in: {found_tables}")
spareTimeCoder
  • 212
  • 2
  • 12