2

I have a pandas DataFrame in python and want this DataFrame directly to be written into a Netezza Database.

I would like to use the pandas.to_sql() method that is described here but it seems like that this method needs one to use SQLAlchemy to connect to the DataBase. The Problem: SQLAlchemy does not support Netezza. What I am using at the moment to connect to the database is pyodbc. But this o the other hand is not understood by pandas.to_sql() or am I wrong with this?

My workaround to this is to write the DataFrame into a csv file via pandas.to_csv() and send this to the Netezza Database via pyodbc.

Since I have big data, writing the csv first is a performance issue. I actually do not care if I have to use SQLAlchemy or pyodbc or something different but I cannot change the fact that I have a Netezza Database.

I am aware of deontologician project but as the author states itself "is far from complete, has a lot of bugs". I got the package to work (see my solution below). But if someone nows a better solution, please let me know!

I figured it out. For my solution see accepted answer.

NemesisMF
  • 1,501
  • 1
  • 14
  • 18
  • 1
    Did you try using the dialect to see if it works in your particular case? When I took over the [sqlalchemy-access](https://github.com/sqlalchemy/sqlalchemy-access) dialect the first thing I did was to get pandas `to_sql` working. – Gord Thompson Sep 17 '19 at 15:19
  • 1
    Check the list returned by `pyodbc.drivers()` to see what drivers are available to your Python app. Do you see the Netezza driver in that list? – Gord Thompson Sep 18 '19 at 11:56
  • Yes. "NetezzaSQL" is in the list. I got rid of the `pyodbc.InterfaceError`. The error gave me the hint to add a data source in the the Windows ODBC Data source administrator tool and use that instead of trying to "manually" type in the server address. Then it gets the driver. I don't know if this is normal behavior, since in pyodbc it was not neccessary to to that. So instead of `engine = create_engine(netezza://usr:pass@address:port/database_name)` I use `engine = create_engine(netezza://ODBCDataSourceName)` It still does not work yet. I will update my question soon. – NemesisMF Sep 19 '19 at 09:00
  • Thanks for the effort, Gord! I found a solution. Edited my questions to include it. – NemesisMF Sep 19 '19 at 14:02
  • 1
    Thanks for taking the time to follow up. It's fairly common for people here to post their solution as an answer to their own question. That makes it easier for others to benefit, and if you accept your own answer it serves to mark the issue as "resolved". – Gord Thompson Sep 19 '19 at 16:36
  • Thanks for the advice! Edited question and put solution in an accepted answer. I am fairly new here, so I appreciate to get help with the protocol. – NemesisMF Sep 19 '19 at 16:47

2 Answers2

4

Solution

I found a solution that I want to share for everyone with the same problem. I tried the netezza dialect from deontologician but it does not work with python3 so I made a fork and corrected some encoding issues. I uploaded to github and it is available here. Be aware that I just made some small changes and that is mostly work of deontologician and nobody is maintaining it.

Having the netezza dialect I got pandas.to_sql() to work directy with the Netezza database:

import netezza_dialect
from sqlalchemy import create_engine

engine = create_engine("netezza://ODBCDataSourceName")

df.to_sql("YourDatabase", 
          engine,  
          if_exists='append',
          index=False,
          dtype=your_dtypes,
          chunksize=1600,
          method='multi')

A little explaination to the to_sql() parameters:

It is essential that you use the method='multi' parameter if you do not want to take pandas for ever to write in the database. Because without it it would send an INSERT query per row. You can use 'multi' or you can define your own insertion method. Be aware that you have to have at least pandas v0.24.0 to use it. See the docs for more info.

When using method='multi' it can happen (happend at least to me) that you exceed the parameter limit. In my case it was 1600 so I had to add chunksize=1600 to avoid this.

Note

If you get a warning or error like the following:

C:\Users\USER\anaconda3\envs\myenv\lib\site-packages\sqlalchemy\connectors\pyodbc.py:79: SAWarning: No driver name specified; this is expected by PyODBC when using DSN-less connections
  "No driver name specified; "
pyodbc.InterfaceError: ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')

Then you propably treid to connect to the database via

engine = create_engine(netezza://usr:pass@address:port/database_name)

You have to set up the database in the ODBC Data Source Administrator tool from Windows and then use the name you defined there.

engine = create_engine(netezza://ODBCDataSourceName)

Then it should have no problems to find the driver.

NemesisMF
  • 1,501
  • 1
  • 14
  • 18
1

I know you already answered the question yourself (thanks for sharing the solution)

One general comment about large data-writes to Netezza: I’d always choose to write data to a file and then use the external table/ODBC interface to insert the data. Instead of inserting 1600 rows at a time, you can probably insert millions of rows in the same timeframe.

We use UTF8 data in the flat file and CSV unless you want to load binary data which will probably require fixed width files.

I’m not a python savvy but I hope you can follow me ...

If you need a documentation link, you can start here: https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.load.doc/c_load_create_external_tbl_syntax.html

Lars G Olsen
  • 1,093
  • 8
  • 11
  • Thanks for the information. You are right. The 1600 parameter limit was a surprise for me and is kind of a drawback for me. I am relatively new to databases and maybe my thinking is to naive. But lets say I have data with millions of rows in my code and in my system memory. My naive thinking was, I want to write that directly to netezza instead of writing a csv file first which needs conversion to ASCII and time for writing the file on the HDD just to open it again and upload it to netezza. And you are right: Because of the big amount of data I would prefer saving binary instead of CSV. – NemesisMF Sep 20 '19 at 09:35
  • 1
    You Will unfortunately need to convert it to a format the database supports, there is no ‘binary’ insert option. I was referring to columns of data containing PDF files or the like. – Lars G Olsen Sep 21 '19 at 12:50
  • 1
    UTF8-CSV is probably the best. Study the ‘escaping’ capabilities of netezza external tables so you can handle your four needed ‘special characters: row separator, column separator, escape character, string quotation mark. I usually go with (in same order) Chr(13), Chr(11), chr(92), chr(34) - or as they go by in ‘civil life’: Carriage Return, Tab, backslash, Double quote – Lars G Olsen Sep 21 '19 at 12:58
  • I think you are right, because of the chunksize loading a file is probably faster. I still try out a few things to "avoid" a detour over a CSV file. Maybe instead of `method='multi'` I define my own insertion method. Sadly there doesn't seem to be a `BULK INSERT` in netezza. Maybe `COPY FROM` will help (at least it has an option for binary files). I will dig into the IBM reference. Another possibility I see: To solve the problem via SQLAlchemy via `executemany()` I just read about in [this question](https://stackoverflow.com/q/48006551/6351763). But I expect problems with the custom dialect. – NemesisMF Sep 21 '19 at 19:24
  • 1
    I understand your hesitation, file I/O is much slower that CPU/RAM speed... but you have a TCP/IP network in the mix and an ODBC driver with sloooow code. Don’t expect more than 40MB/sec under those conditions. Furthermore the processing of external tables on netezza are pretty good. Look at file-pipes if you are on Linux/Unix at the client (pandas) end. That way all I/O is kept in memory and you STILL get to bypass the odbc typeconversion code that’s the real bottleneck when you try to avoid files. In other words: trust me: files are faster - intuitive or not :) – Lars G Olsen Sep 22 '19 at 12:20
  • 1
    Like @LarsGOlsen was describing, write the data into a file, and use a `transient external table` to load data. That means you'll create an INSERT statement on top an external file on a location to load data. Documentations should have more info about how to – demircioglu Oct 05 '19 at 00:14