13

I am trying to use use a temp table with SQLAlchemy and join it against an existing table. This is what I have so far

engine = db.get_engine(db.app, 'MY_DATABASE')
df = pd.DataFrame({"id": [1, 2, 3], "value": [100, 200, 300], "date": [date.today(), date.today(), date.today()]})
temp_table = db.Table('#temp_table',
                      db.Column('id', db.Integer),
                      db.Column('value', db.Integer),
                      db.Column('date', db.DateTime))
temp_table.create(engine)
df.to_sql(name='tempdb.dbo.#temp_table',
          con=engine,
          if_exists='append',
          index=False)
query = db.session.query(ExistingTable.id).join(temp_table, temp_table.c.id == ExistingTable.id)
out_df = pd.read_sql(query.statement, engine)
temp_table.drop(engine)
return out_df.to_dict('records')

This doesn't return any results because the insert statements that to_sql does don't get run (I think this is because they are run using sp_prepexec, but I'm not entirely sure about that).

I then tried just writing out the SQL statement (CREATE TABLE #temp_table..., INSERT INTO #temp_table..., SELECT [id] FROM...) and then running pd.read_sql(query, engine). I get the error message

This result object does not return rows. It has been closed automatically.

I guess this is because the statement does more than just SELECT?

How can I fix this issue (either solution would work, although the first would be preferable as it avoids hard-coded SQL). To be clear, I can't modify the schema in the existing database—it's a vendor database.

Kris Harper
  • 5,672
  • 8
  • 51
  • 96
  • are there any records in `ExistingTable`? – Azat Ibrakov May 23 '17 at 17:20
  • @AzatIbrakov Yeah. I actually changed it to a left join and added `temp_table.c.date` just to be sure. I get rows back with `None` in the `date` column. – Kris Harper May 23 '17 at 17:26
  • why is your `date` column has type `DateTime` instead of `Date`? – Azat Ibrakov May 23 '17 at 18:01
  • @AzatIbrakov It's just some test code. The table has a datetime. I don't thin that should have any affect on the output though, do you? – Kris Harper May 23 '17 at 18:03
  • i tested it on `sqlite` and it causes problems – Azat Ibrakov May 23 '17 at 18:06
  • @AzatIbrakov Ah. Feel free to change it to `Date`. Or whatever else works. The value doesn't actually matter. Not sure how sqllite handles temp tables though. – Kris Harper May 23 '17 at 18:15
  • "I think this is because they are run using sp_prepexec" This is plausible. A temp table created in a stored procedure, including a call to sp_executesql, sp_prepexec, etc isnot the same as a temp table created in a top level batch, and will be destroyed automatically at the end of the call. – David Browne - Microsoft May 30 '17 at 21:53
  • @KrisHarper: Why do you need to join on this temp table? Where does the data for this table come from? How many entries are usually there? – van May 30 '17 at 23:19
  • @van The data come from another database on another server. I need to incorporate that data into my query (specifically I need to use it in a `SUM` expression). It's not practical to pull all the data from `ExistingTable` and join in code because there may be several million records (but after aggregating those will be reduced to just a few records). – Kris Harper May 31 '17 at 13:10
  • Check out this [recent post](https://stackoverflow.com/q/44270442/1422451) from an OP who had a similar pandas' `to_sql` and sqlalchemy issue, only for Postgres and not SQL Server where `to_sql` produced no error but no resulting table in dataframe push. OP even answered own question. – Parfait May 31 '17 at 15:29
  • @KrisHarper: usually, how many records from that other database are used: several, dozens, hundreds, or more? – van May 31 '17 at 16:09
  • @van About 200. Although only a handful of those will match on the join condition. – Kris Harper May 31 '17 at 19:10

2 Answers2

24

In case the number of records to be inserted in the temporary table is small/moderate, one possibility would be to use a literal subquery or a values CTE instead of creating temporary table.

# MODEL
class ExistingTable(Base):
    __tablename__ = 'existing_table'
    id = sa.Column(sa.Integer, primary_key=True)
    name = sa.Column(sa.String)
    # ...

Assume also following data is to be inserted into temp table:

# This data retrieved from another database and used for filtering
rows = [
    (1, 100, datetime.date(2017, 1, 1)),
    (3, 300, datetime.date(2017, 3, 1)),
    (5, 500, datetime.date(2017, 5, 1)),
]

Create a CTE or a sub-query containing that data:

stmts = [
    # @NOTE: optimization to reduce the size of the statement:
    # make type cast only for first row, for other rows DB engine will infer
    sa.select([
        sa.cast(sa.literal(i), sa.Integer).label("id"),
        sa.cast(sa.literal(v), sa.Integer).label("value"),
        sa.cast(sa.literal(d), sa.DateTime).label("date"),
    ]) if idx == 0 else
    sa.select([sa.literal(i), sa.literal(v), sa.literal(d)])  # no type cast

    for idx, (i, v, d) in enumerate(rows)
]
subquery = sa.union_all(*stmts)

# Choose one option below.
# I personally prefer B because one could reuse the CTE multiple times in the same query
# subquery = subquery.alias("temp_table")  # option A
subquery = subquery.cte(name="temp_table")  # option B

Create final query with the required joins and filters:

query = (
    session
    .query(ExistingTable.id)
    .join(subquery, subquery.c.id == ExistingTable.id)
    # .filter(subquery.c.date >= XXX_DATE)
)

# TEMP: Test result output
for res in query:
    print(res)    

Finally, get pandas data frame:

out_df = pd.read_sql(query.statement, engine)
result = out_df.to_dict('records')
van
  • 74,297
  • 13
  • 168
  • 171
  • Ha, I was *just* thinking about doing this on my commute in this morning. I'll give it a shot and let you know. – Kris Harper Jun 01 '17 at 12:43
  • 1
    and performance-wise, how good is this solution? Union all seems really heavy to me... Need to test it but it would be awesome if there is some kind thoughts about performance... – Alexander B. Apr 02 '20 at 04:11
  • Why use UNION ALL? –  May 18 '22 at 21:06
  • `UNION ALL` instead of just `UNION` to avoid database engine eliminate duplicates as one could do that already in this code. See https://stackoverflow.com/a/49928/99594 for more information. I do not think it is _heavy_ as the solution is explicitly for not overly large number of such records. – van May 19 '22 at 03:02
  • @van What would you consider to be a "small/moderate" number of records? I am interested in either a temp table (and COPY FROM using csv) or CTE to enable a complex insert of ~2,500 rows. INSERT INTO VALUES is painfully slow for this. – freethebees Jan 03 '23 at 16:08
  • @freethebees, the best way for you is to profile the solution and see if it works. Also, since the answer was posted, sqalalchemy should have a more optimized way to insert multiple data with other constructs. – van Jan 05 '23 at 06:20
1

You can try to use another solution - Process-Keyed Table

A process-keyed table is simply a permanent table that serves as a temp table. To permit processes to use the table simultaneously, the table has an extra column to identify the process. The simplest way to do this is the global variable @@spid (@@spid is the process id in SQL Server).

...

One alternative for the process-key is to use a GUID (data type uniqueidentifier).

http://www.sommarskog.se/share_data.html#prockeyed

Mikhail Lobanov
  • 2,976
  • 9
  • 24
  • Are you suggesting creating this table in tempdb? I think either way, it will run into the same issue that I am experiencing in the second part of my question where `read_sql` doesn't return any rows. – Kris Harper May 31 '17 at 14:41
  • This table should be created in your db (`MY_DATABASE`), not a temp db. It's not a good, but it should work. – Mikhail Lobanov May 31 '17 at 14:44
  • I don't have access to create tables in that database. It's a vendor database. – Kris Harper May 31 '17 at 14:46