0

I am running a very simple query using python on a table from Snowflake database using the package snowflake-connector-python==2.3.3 installed with the additional [pandas] I containerized my python app using the python:3.7.0-slim image. And my script is extremely simple.

from snowflake import connector

import os

ctx = connector.connect(
    user=os.environ['USER'],
    password=os.environ['PASSWORD'],
    account=os.environ['ACCOUNT'],
    warehouse=os.environ['WAREHOUSE'],
    database=os.environ['DATABASE'],
    schema=os.environ['SCHEMA'])

cur = ctx.cursor()

# Execute a statement that will generate a result set.
sql = "SELECT * FROM MY_TABLE ORDER BY MY_COLUMN"

print("executing query: " + sql)

cur.execute(sql)

df = cur.fetch_pandas_all()

The actual table size from what Snowflake tells me is 3.3 GB. However when I run this app it crashes as it takes over 9GB of RAM. I know this because I'm running it in a kubernetes cluster and the pod is evicted and says it used 9535336Ki memory. Is there something I'm missing here? How can the memory usage be 3x the table size?

alex
  • 1,905
  • 26
  • 51
  • This may not be simple: `SELECT * FROM`. See [why](https://stackoverflow.com/questions/3639861/why-is-select-considered-harmful). Try selecting exact needed columns for app. – Parfait Oct 19 '20 at 19:38
  • One of the top comment's reads `It's acceptable to use SELECT * when there's the explicit need for every column in the table(s) involved`. And I need every column. Also it still doesn't answer the question: the table itself is 3.3 GB but the size of my container grows beyond 9GB so why is that? – alex Oct 19 '20 at 19:42
  • Understood but for maintainable code, always a good idea to specify explicitly the columns and in consistent order in case underlying table or code changes. For debugging, try selecting one, two, three, etc.. columns at a time one and check RAM usage. Watch for large data types (object, arrays, geospatial). I also wonder about `cur.fetch_pandas_all()`. Try removing for debugging reasons to isolate problematic line. [Docs](https://docs.snowflake.com/en/user-guide/python-connector-api.html#label-python-connector-api-fetch-pandas-all) indicate faster than pandas' `read_sql` but I wonder. – Parfait Oct 19 '20 at 19:53
  • Cool thank you for the callout - I should be a bit more specific. When I put the pipeline in production I will switch to naming all columns. I actually found a similar issue in this post https://stackoverflow.com/questions/41253326/pandas-using-too-much-memory-with-read-sql-table . I'm thinking it might be a pandas related thing – alex Oct 19 '20 at 19:56

0 Answers0