1

I have a big table residing in an MPP database (Netezza). The table has 46 million rows and 150 columns. It uses 9 GB according to the metadata.

In an sql client I can retrieve the sum of a column in under a minute: select sum(BYTES) from mybigtable

In the proprietary BI-tool Tableau I can establish a live connection to the database and get the sum of this column in under a minute. This is because the Tableau engine pushes down the summation logic to the database, so that the data doesn't need to be transferred to the machine running Tableau.

So I'm looking for the same kind of functionality when doing exploratory data analysis in Python/Jupyter, preferably with a Pandas-like interface.

For IBM Db2 there is a package that provides what I'm looking for: ibmdbPy

The ibmdbpy project provides a Python interface for data manipulation and access to in-database algorithms in IBM Db2. It accelerates Python analytics by seamlessly pushing operations written in Python into the underlying database for execution, thereby benefitting from in-database performance-enhancing features, such as columnar storage and parallel processing [...] it copies the well-known interface of the Pandas library for data manipulation [...]

But this package is limited to Db2, thus not applicable to other MPPs/column-stores.

Approaches tried in Python

When using Pandas:

df = pd.read_sql('select * from mybigtable', conn)
df['BYTES'].sum()

The entire dataset is first read into memory. This takes a lot of time, and memory will anyway run out so the kernel will die.

I tried using Dask:

params = urllib.parse.quote_plus("DRIVER=/usr/local/nz/lib64/libnzodbc.so;SERVER=netezza;PORT=5480;DATABASE=mydb;UID=john;PWD=password")
df = dd.read_sql_table('mybigtable', "netezza+pyodbc:///?odbc_connect=%s" % params, index_col='ID', engine_kwargs={'echo':True})
df['BYTES'].sum().compute()

There are serveral problems with this approach: 1) The data still needs to be read into memory. 2) It is read partition-by-partition, which takes a long time, as there are many partitions. 3) To my surprise, the log from sqlalchemy shows that all columns are read in, not only the column I'm summing over. 4) Memory still gets clogged up, it seems like memory is not freed up after each partition is read.

Note

I understand that I could pick out one or a few columns to work with when creating the dataframe, as this would significantly reduce the amount of data. However, this doesn't give me the flexibility and ease I'm getting in Tableau, as I'm exploring the whole dataset with no prior hypothesis about what columns are interesting.

matthiash
  • 3,105
  • 3
  • 23
  • 34
  • The package "blaze" was developed to do exactly what you want, but has not been updated/supported for a long time. It might still work. – mdurant Jul 22 '20 at 13:12
  • Tableau is not proprietary - it's a leading package to provide analytics to enterprises :) Replacing it is not small undertaking in terms of engineering. You will need to build cubes that provide the type of analytics in a middle tier. I know organisations that have used Kafka as middle tier – Rob Raymond Jul 24 '20 at 03:18

1 Answers1

1

You might want to take a look at Ibis

MRocklin
  • 55,641
  • 23
  • 163
  • 235