2

I frequently pull data from different databases into pandas dataframes, do some processing, and then eventually hit a processing step that I would strongly prefer to write in SQL (often involving joins across DFs). I figured I'd write a quick function which would:

  • accept an arbitrary sql query and set of relevant dataframes
  • write the dataframes to a tmp sqlite3 db
  • execute the SQL query in sqlite3 and return results as a dataframe

The function I came up with looks like this:

def SQL_on_DFs(query:str, dfs=[]) -> pd.DataFrame:
    """convenience function to write sql on arbitrary DFs
        ...execute in a temp sqlite db and return results as DF
    """
    # validation - every item in dfs list must be a dataframe
    assert [isinstance(x, pd.DataFrame) for x in dfs].all()

    # if no temp.db exists here, sqlite3 will create one
    sql3conn = sqlite3.connect("temp.db")
    
    # write dfs to sqlite3 db to perform sql query on them there
    for df in dfs:
        df.to_sql(sql3conn)
    
    result = pd.read_sql_query(query, sql3conn)
    return result

# TEST
df_ab = pd.DataFrame({"a": [1,2,3], "b": [2,2,2]})
df_ac = pd.DataFrame({"a": [1,2,3], "c": [3,3,3]})
df_abc = pd.DataFrame({"a": [1,2,3], "b": [2,2,2], "c": [3,3,3]})

q = "select ab.*, ac.c from df_ab ab left join df_ac ac on ab.a = ac.a"
assert SQL_on_DFs(q, [df_ab, df_ac]).equals(df_abc)

Super useful, super easy. Right?! Wrong! Don't you feel like a sucker. Definitely you and not me.

df.to_sql(sql3conn) throws an error. It says it's missing 1 required positional argument: 'con', but we've passed a con. What it's missing is the other required positional parameter, name of the table to write to sqlite3 as. So I need a way to get that the "name" of df_ab should be "df_ab", and the "name" of df_ac should be "df_ac".

Now I know what you're thinking: Max Power, why don't you just change your second function parameter from dfs = [df_ab, df_ac], to [(df_ab, "df_ab"), (df_ac, "df_ac")]?

Please don't make me do that. This is supposed to be a convenience function and that looks inconvenient

Max Power
  • 8,265
  • 13
  • 50
  • 91
  • Max Power, why don't you just change your second function parameter from `dfs = [df_ab, df_ac]`, to `[(df_ab, "df_ab"), (df_ac, "df_ac")]`? –  Feb 04 '22 at 20:48
  • 1
    you're killing me, smalls – Max Power Feb 04 '22 at 20:52
  • I'm unsure to understand your idea of *convenience function* here. The name `df_ab` and `df_ac` are hardcoded in your query by the users, which makes sense or nothing happens. But if the users have to do this, I'm not sure why the idea of @richardec is that inconvenient. I would rather use dictionary to store the dataframes, `dfs = {'df_ab':pd.DataFrame({"a": [1,1,1], "b": [2,2,2]}), 'df_ac':pd.....}` then slightly adapt your function but anyway the users should rather specify the name somewhere to fit what table names are used in the query. – Ben.T Feb 04 '22 at 21:24
  • Note that there are possibilities to [get the name of a variable](https://stackoverflow.com/a/65756509/9274732) but does not look worth it in this case – Ben.T Feb 04 '22 at 21:24
  • 1
    Hi Ben, thanks for the thoughtful response. And that `python-varname` package actually looks promising to me, I'll check it out. As far as my idea of convenience, often I have like df1 df2, and df3 sitting around and then realize I want to write some sql like ("select df1.*, df2.somevar..."). And I'd like to be able to execute this query with as little extra work and extraneous code as possible. constructing that extra dict feels like extra work and takes up extra lines of code or will make a single-line function call extra long. – Max Power Feb 04 '22 at 21:52
  • 1
    alright so update on the `python-varname` `nameof()` approach - doesn't seem to work (at least how I tried) because `nameof(df)` under `for df in dfs` is just `"df"`, not the df's name in the function call. – Max Power Feb 04 '22 at 22:42

1 Answers1

2

Ok I have an answer, and this method doesn't even require the DFs as a param at all. It parses their names from the SQL statement and then grabs the corresponding objects. This answer satisfies the following:

  • have: some Pandas DataFrames lying around
  • want: a function that accepts a SQL query referencing them as if they were SQL tables, executes the SQL query and returns the result as a DataFrame

...

Given this setup (imports and helper function)...

import pandas as pd 
import sqlite3
from sql_metadata import Parser
from inspect import currentframe


def get_dfs_from_caller_scope(df_names: list[str]) -> dict[str, pd.DataFrame]:
    """Given a list of DF names, this returns a dict mapping df-name to df from original* scope
    
    *'original scope' in this function context meaning the caller's caller's scope
     """
    vars_in_orig_scope = currentframe().f_back.f_back.f_locals

    dfs_map = {}
    for k,v in vars_in_orig_scope.items():
        if k in df_names and isinstance(v, pd.DataFrame):
            dfs_map[k] = v

    return dfs_map

This function now does what I want:

def SQL_on_DFs(query:str)->pd.DataFrame:
    """convenience function to write sql on arbitrary DFs
        ...execute in a temp sqlite db and return results as DF
    """
    # get table-names referenced by query, which should correspond to dataframes in caller's scope
    tablenames = Parser(query).tables

    # get a mapping of those tablenames in query to the actual corresponding DataFrame objects from caller's scope 
    tablenames_and_assoc_dfs = get_dfs_from_caller_scope(tablenames)

    # if no temp.db exists here, sqlite3 will create one
    sql3conn = sqlite3.connect("temp.db")
    
    # write dfs to sqlite3 db to perform sql query on them there
    for df_name,df in tablenames_and_assoc_dfs.items():
        df.to_sql(df_name, sql3conn, index=False, if_exists='replace')
    
    result = pd.read_sql_query(query, sql3conn)
    return result

TEST:

df_ab = pd.DataFrame({"a": [1,2,3], "b": [2,2,2]})
df_ac = pd.DataFrame({"a": [1,2,3], "c": [3,3,3]})
df_abc = pd.DataFrame({"a": [1,2,3], "b": [2,2,2], "c": [3,3,3]})

q = "select ab.*, ac.c from df_ab ab left join df_ac ac on ab.a = ac.a"
assert SQL_on_DFs(q).equals(df_abc)  # Passes
Max Power
  • 8,265
  • 13
  • 50
  • 91
  • Not sure what I think about this answer. I kinda really like it, but also think it might be terrible? Will leave this open for a week to see what anyone else comes up with before considering accepting mine. – Max Power Feb 05 '22 at 04:27
  • 1
    Won't lie, I don't really like the idea of getting the variable names, but I have to admit that you tackled your problem in a clean and pretty solid way :) and you made me discover a new library (`sql_metadata`) that I need. Good job! – Ben.T Feb 05 '22 at 13:09