3

I have a Flask app and want to use duckdb as a database for several endpoints. My idea is to query the data and return it as a .parquet file. When I test my database with a simple Python script outside of the Flask app, it can query the data and save it as a .parquet in under a second. When I bring that same methodology to the Flask app, it still successfully queries the data and returns it as a .parquet file but it takes roughly 45 seconds. Other endpoints that return a .parquet file -- ones that are pre-staged and do not need to be queried -- can do so in just a second or two. So the issue, apparently, is incorporating duckdb inside my Flask application. Here is a sample boiler plate of what I have:

@test.route('/duckdb', methods = ['GET'])
def duckdb_test():

    con = duckdb.connect(database = '~/flask_db/test.db')

    # get tempfile .parquet
    tmp = tempfile.NamedTemporaryFile(suffix = '.parquet', mode = 'w+b', delete = False)

    # get data
    df = con.sql("SELECT * FROM tbl WHERE name = 'John'").to_df()

    # write to temporary .parquet
    df.to_parquet(tmp.name, engine='pyarrow', index=False)

    return send_file(tmp.name, mimetype='application/octet-stream', as_attachment=True, download_name="request.parquet")

I want to save it as a temp file. Not really sure what's wrong here. Again, it does work, but it just takes way, way too much time. The data being returned is about 12,000 rows in a ~5.5M row database -- but given that it works fairly quickly outside of the Flask app, on the same VM, the size itself shouldn't be an issue.

codeweird
  • 145
  • 3
  • 11
  • Have you tried using DuckDB to write the parquet file instead of pandas+pyarrow? And does the exact same code run faster outside of Flask, or were you only testing something similar? – Mause Jun 20 '23 at 14:24
  • @Mause, yeah so using `duckdb` to write the parquet file helps a little. The request now returns in 15 seconds, a big improvement, but still much, much slower than what is anticipated. I'm running on a dev. server and not a gunicorn, etc. one, but I don't expect that to really be affecting much, given that the speed of other endpoints is still very, very quick. The other code that is running outside of Flask is basically the same thing (same query and using `duckdb` to write the parquet file). Assuming the hang-up is something with Flask. – codeweird Jun 20 '23 at 15:08
  • Just to troubleshoot, try updating the db path from relative to absolute. So from ```'~/flask_db/test.db'``` to ```'/path/to/flask_db/test.db'``` or if Windows based, to ```'C:/path/to/flask_db/test.db'```. See if that plays any factor. – djmonki Jun 23 '23 at 00:25
  • 1
    Check this answer, see if it helps with your issue: https://stackoverflow.com/a/19411051/11543023 – djmonki Jun 23 '23 at 10:07
  • 1
    @djmonki yeah, so shifting to a gunicorn server is much, much faster and the speeds are now what I expect. I'm going to need to play around with my SQL back-end though, because gunicorn isn't connecting to my local MySQL server, whereas running the app in a development server *does* connect to it. Might have to shift to the 127.0.0.1 format instead of localhost – codeweird Jun 23 '23 at 15:34
  • Another thing to do is to switch from `flask` to `fastapi`. flask is single threaded whereas fastapi can serve asynchronously. – Dean MacGregor Jul 26 '23 at 11:49
  • @DeanMacGregor, will do. I've written this out pretty extensively in Flask already, though, so I am hesitant about switching stacks. This will be in (minor) production, so speed certainly matters. But rewriting it would be annoying. – codeweird Aug 03 '23 at 18:14

1 Answers1

1

My Suggestion is not to connect to the DB every time the API is being hitted. Instead create a engine and reuse the same connection. a thing called "Engine" can be helpful to you, so that you don't have to worry about connecting to the db everytime you hit the DB.

check this pypi package here at pypi duck db engine package:

steps to install and use in flask:

Installation:

 pip install duckdb-engine

Engine Creation: plain engine method

Create the engine in some area like where you create and parse the config in flask and use the engine wherever you need.

 eng = create_engine("duckdb:///:memory:")

Engine Creation: Session creation method

eng = create_engine("duckdb:///:memory:")
Base.metadata.create_all(eng)
session = Session(bind=eng)

Import and Use the engine where ever required:

 eng.execute("register", ("dataframe_name", pd.DataFrame(...)))
 eng.execute("select * from dataframe_name")

Sample Examples to use with pandas:

  df = pd.read_sql('users', engine)
Surya R
  • 445
  • 2
  • 10