Python sql query taking up 3x memory than table size?

Question

I am running a very simple query using python on a table from Snowflake database using the package snowflake-connector-python==2.3.3 installed with the additional [pandas] I containerized my python app using the python:3.7.0-slim image. And my script is extremely simple.

from snowflake import connector

import os

ctx = connector.connect(
    user=os.environ['USER'],
    password=os.environ['PASSWORD'],
    account=os.environ['ACCOUNT'],
    warehouse=os.environ['WAREHOUSE'],
    database=os.environ['DATABASE'],
    schema=os.environ['SCHEMA'])

cur = ctx.cursor()

# Execute a statement that will generate a result set.
sql = "SELECT * FROM MY_TABLE ORDER BY MY_COLUMN"

print("executing query: " + sql)

cur.execute(sql)

df = cur.fetch_pandas_all()

The actual table size from what Snowflake tells me is 3.3 GB. However when I run this app it crashes as it takes over 9GB of RAM. I know this because I'm running it in a kubernetes cluster and the pod is evicted and says it used 9535336Ki memory. Is there something I'm missing here? How can the memory usage be 3x the table size?

This may not be simple: `SELECT * FROM`. See [why](https://stackoverflow.com/questions/3639861/why-is-select-considered-harmful). Try selecting exact needed columns for app. — Parfait, Oct 19 '20 at 19:38
One of the top comment's reads `It's acceptable to use SELECT * when there's the explicit need for every column in the table(s) involved`. And I need every column. Also it still doesn't answer the question: the table itself is 3.3 GB but the size of my container grows beyond 9GB so why is that? — alex, Oct 19 '20 at 19:42
Understood but for maintainable code, always a good idea to specify explicitly the columns and in consistent order in case underlying table or code changes. For debugging, try selecting one, two, three, etc.. columns at a time one and check RAM usage. Watch for large data types (object, arrays, geospatial). I also wonder about `cur.fetch_pandas_all()`. Try removing for debugging reasons to isolate problematic line. [Docs](https://docs.snowflake.com/en/user-guide/python-connector-api.html#label-python-connector-api-fetch-pandas-all) indicate faster than pandas' `read_sql` but I wonder. — Parfait, Oct 19 '20 at 19:53
Cool thank you for the callout - I should be a bit more specific. When I put the pipeline in production I will switch to naming all columns. I actually found a similar issue in this post https://stackoverflow.com/questions/41253326/pandas-using-too-much-memory-with-read-sql-table . I'm thinking it might be a pandas related thing — alex, Oct 19 '20 at 19:56

Python sql query taking up 3x memory than table size?

0 Answers0