I frequently pull data from different databases into pandas dataframes, do some processing, and then eventually hit a processing step that I would strongly prefer to write in SQL (often involving joins across DFs). I figured I'd write a quick function which would:
- accept an arbitrary sql query and set of relevant dataframes
- write the dataframes to a tmp sqlite3 db
- execute the SQL query in sqlite3 and return results as a dataframe
The function I came up with looks like this:
def SQL_on_DFs(query:str, dfs=[]) -> pd.DataFrame:
"""convenience function to write sql on arbitrary DFs
...execute in a temp sqlite db and return results as DF
"""
# validation - every item in dfs list must be a dataframe
assert [isinstance(x, pd.DataFrame) for x in dfs].all()
# if no temp.db exists here, sqlite3 will create one
sql3conn = sqlite3.connect("temp.db")
# write dfs to sqlite3 db to perform sql query on them there
for df in dfs:
df.to_sql(sql3conn)
result = pd.read_sql_query(query, sql3conn)
return result
# TEST
df_ab = pd.DataFrame({"a": [1,2,3], "b": [2,2,2]})
df_ac = pd.DataFrame({"a": [1,2,3], "c": [3,3,3]})
df_abc = pd.DataFrame({"a": [1,2,3], "b": [2,2,2], "c": [3,3,3]})
q = "select ab.*, ac.c from df_ab ab left join df_ac ac on ab.a = ac.a"
assert SQL_on_DFs(q, [df_ab, df_ac]).equals(df_abc)
Super useful, super easy. Right?! Wrong! Don't you feel like a sucker. Definitely you and not me.
df.to_sql(sql3conn)
throws an error. It says it's missing 1 required positional argument: 'con'
, but we've passed a con
. What it's missing is the other required positional parameter, name
of the table to write to sqlite3 as. So I need a way to get that the "name" of df_ab
should be "df_ab"
, and the "name" of df_ac
should be "df_ac"
.
Now I know what you're thinking: Max Power, why don't you just change your second function parameter from dfs = [df_ab, df_ac]
, to [(df_ab, "df_ab"), (df_ac, "df_ac")]
?
Please don't make me do that. This is supposed to be a convenience function and that looks inconvenient