0

I've written some code to run all the .sql files in a given directory and write the output of each to csv. However, I'm running into an issue with a very large query that eats up all my RAM and then crashes the process. Unfortunately, the query output is going to be that large, so the only solution I can see is to write the output directly to file.

The script in question:

import pandas as pd
from sqlalchemy.engine import URL
from sqlalchemy import create_engine

serverName = 'FOO'
databaseName = 'BAR'

connection_string = ("Driver={SQL Server};"
            "Server=" + serverName + ";"
            "Database=" + databaseName + ";"
            "Trusted_Connection=yes;")

connection_url = URL.create(
    "mssql+pyodbc", query={"odbc_connect":connection_string })
engine = create_engine(connection_url)

sqlFile = "select 'Hello world' as 'Header'"

# Run the SQL (sqlFile) on the connection (engine)
pd.read_sql(sqlFile, engine).to_csv('foo.csv', header=True, index=False, sep='|')

I had previously read the sql into a variable (data frame) and then written that out to csv.

data = pd.read_sql(sqlFile, engine)
data.to_csv(file_name)

I was hoping that by changing to the following, the output would write directly to file, as I'd removed the "store output as variable data" step.

pd.read_sql(sqlFile, engine).to_csv(file_name)

How can I have the output of a sql query written directly to a file, row by row, without storing the entire output in memory prior to writing the file?

(Edited to minimally runnable code)

Sky
  • 1
  • 2
  • 1
    Welcome to Stack Overflow. To be clear: the `.sql` file contains an SQL **query**, and you currently are relying on Pandas to route this to `read_sql_query`, giving the query to the `engine` and get the results? It **does not** contain some raw SQL table data to convert to CSV? – Karl Knechtel Jan 05 '23 at 00:41
  • Also, where does e.g. `create_engine` come from? Please read [mre]; if you want to fix an existing piece of code, try to make sure that others could **copy and paste** the code **without changing or adding anything** and **directly** reproduce the **exact** problem. However, in this case it seems like the code is not really useful for answering the question. `Pandas.read_sql` is doing to put everything in memory up front no matter what; this task **can** be done but it will require a **completely different approach**. – Karl Knechtel Jan 05 '23 at 00:44
  • Create_Engine comes from: from sqlalchemy import create_engine The .sql file contains a sql query. I'll rewrite so it's minimally reproducible, but it'll take me a few. I just dropped in the code to give an idea of what I was doing. To be clear though... the code can't be reproducible unless you have a sql server and change a couple settings in the code, so I wasn't sure what to do there. To that end, I'll take any approach as I can't find a solution anywhere. – Sky Jan 05 '23 at 00:50
  • I updated the code to be minimal and reproducible. – Sky Jan 05 '23 at 00:59

1 Answers1

1

Pandas will not be useful here. Instead, run the query directly, which gives you a cursor for the database. The exact mechanism for this will depend on your SQL engine, but for example the built-in standard library sqlite3 offers a guided tutorial.

Using SQLAlchemy, we can ask for the necessary underlying connection and cursor objects, use the cursor to run the query, iterate over the rows, and write each using a csv.writer:

connection = engine.raw_connection()
cursor = connection.cursor()
query = "select 'Hello world' as 'Header'"

with open('foo.csv', 'w') as f:
    w = csv.writer(f, delimiter='|', )
    w.writerow(('Header',))
    for row in cur.execute(query):
        w.writerow(row)

# followed by whatever usual cleanup logic for SQLAlchemy

I am not familiar with SQLAlchemy specifically, but it likely offers a more direct way to iterate over the query. The key idea here is to set up this kind of loop over the results so that it can be fed to a csv.writer.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • This seems to be working perfectly. I used the post here (https://stackoverflow.com/questions/49797786/how-to-get-cursor-in-sqlalchemy) for the sqlalchemy specific method to using a cursor. The real test will be when I re-write my original code and run the hours-long query. – Sky Jan 05 '23 at 01:23