How to write pandas dataframe to HDFS cluster using pyodbc?

Question

I am using DSN to connect my local python to an HDFS cluster:

import pyodbc    
    with pyodbc.connect("DSN=CDH_HIVE_PROD", autocommit=True) as conn:
        df = pd.read_sql("""Select * from table1""", conn)
df

how do I write this table back to the cluster as 'table1tmp'? Do I need a create statement to create the table first? And then how do I insert data from a pandas dataframe?

I assume this is something done frequently enough to where it should be fairly easy (pull data, do something, save data back), but am not able to find any examples that use pyodbc, or DSN, which seems to be my only way of connecting.

[This question](https://stackoverflow.com/q/42210901/2144390) (and its answers) suggest that [PyHive](https://github.com/dropbox/PyHive#sqlalchemy) might be an option. — Gord Thompson, Dec 22 '20 at 19:59
TL;DR - To use pandas `read_sql_query()` *et al.* you can get away with using a plain DBAPI connection but to use `to_sql()` to write back to the database you need to use a SQLAlchemy `Connectable` (`Engine` or `Connection`). — Gord Thompson, Dec 22 '20 at 20:33

score 1 · Answer 1 · answered Feb 14 '22 at 13:22

We had a similar problem. Thanks to the comment about SQLAlchemy we came to a workable solution:

def my_create_engine(mydsn, mydatabase, **kwargs):
connection_string = 'mssql+pyodbc://@%s' % mydsn
cargs = {'database': mydatabase}
cargs.update(**kwargs)
e = sqlalchemy.create_engine(connection_string, connect_args=cargs)
return e 
engine = my_create_engine('CDH_HIVE_PROD', 'CDH_HIVE_PROD') 
con = pyodbc.connect("DSN=CDH_HIVE_PROD")
pd.io.sql.to_sql(frame=df_output, name='test', schema='dbo', con=con, if_exists='replace', index=False )

How to write pandas dataframe to HDFS cluster using pyodbc?

1 Answers1