I'm using the impyla package to run a series of queries with Hive on Spark from python using the impyla package's SQLAlchemy support. SQLAlchemy automatically creates and closes a dbapi cursor for each sql statement that is executed. Because the impyla HiveServer2Cursor implementation closes the underlying Hive session, each sql statement ends up running as a separate Spark job. I would like to avoid the overhead of starting up a new Spark job for every sql statement AND utilize SQLAlchemy instead of the raw dbapi interface.
It certainly does work to reuse the dbapi cursor, but again I would prefer to use the SQLAlchemy engine with its connection pooling and automatic cursor management features.
# this version uses raw dbapi and only one cursor and therfore one hive session
con = connect(host='cdh-dn8.ec2.internal', port=10000, kerberos_service_name='hive', auth_mechanism='GSSAPI')
cur = con.cursor()
cur.execute('set hive.execution.engine=spark')
cur.execute("select * from reference.zipcode where zip = '55112'")
rows = cur.fetchall()
# use data from result and execute more queries ...
cur.close()
con.close()
# this version uses sqlalchemy and one cursor per statement executed, resulting in multiple hive sessions
sqlalchemyengine = create_engine('impala://cdh-dn8.ec2.internal:10000', kerberos_service_name='hive', auth_mechanism='GSSAPI')
conn = sqlalchemyengine.connect()
conn.execute('set hive.execution.engine=spark')
result = conn.execute("select * from reference.zipcode where zip = '55112'")
# use data from result and execute more queries ...
I'm wondering if there is a good reason for impyla to open and close the Hive session with every cursor rather than closing the Hive session when the connection is closed.