Why does impyla library HiveServer2Cursor close the hive session?

Question

I'm using the impyla package to run a series of queries with Hive on Spark from python using the impyla package's SQLAlchemy support. SQLAlchemy automatically creates and closes a dbapi cursor for each sql statement that is executed. Because the impyla HiveServer2Cursor implementation closes the underlying Hive session, each sql statement ends up running as a separate Spark job. I would like to avoid the overhead of starting up a new Spark job for every sql statement AND utilize SQLAlchemy instead of the raw dbapi interface.

It certainly does work to reuse the dbapi cursor, but again I would prefer to use the SQLAlchemy engine with its connection pooling and automatic cursor management features.

# this version uses raw dbapi and only one cursor and therfore one hive session
con = connect(host='cdh-dn8.ec2.internal', port=10000, kerberos_service_name='hive', auth_mechanism='GSSAPI')
cur = con.cursor()
cur.execute('set hive.execution.engine=spark') 
cur.execute("select * from reference.zipcode where zip = '55112'")
rows = cur.fetchall()
# use data from result and execute more queries ...
cur.close()
con.close()


# this version uses sqlalchemy and one cursor per statement executed, resulting in multiple hive sessions
sqlalchemyengine = create_engine('impala://cdh-dn8.ec2.internal:10000', kerberos_service_name='hive', auth_mechanism='GSSAPI')
conn = sqlalchemyengine.connect()
conn.execute('set hive.execution.engine=spark')
result = conn.execute("select * from reference.zipcode where zip = '55112'")
# use data from result and execute more queries ...

I'm wondering if there is a good reason for impyla to open and close the Hive session with every cursor rather than closing the Hive session when the connection is closed.

Why does impyla library HiveServer2Cursor close the hive session?

0 Answers0