1

I have been exploring Polars for my web application. Its been impressive so far, until I hit this issue that has stalled my use of this awesome library. Usecase: I read a parquet file into Polars dataframe, use this pl dataframe to serve results for a get request on FastAPI.

@fastApi.get("/polars-test")
async def polars_test():
    polars_df = pl.read_parquet(f"/data/all_area_keys.parquet")
    df = polars_df.limit(3)
    return df.to_dicts()


polars= 0.16.2
pyarrow=9.0.0
fastapi=0.92.0
BaseDockerImage = tiangolo/uvicorn-gunicorn-fastapi:python3.11

When I package it up into docker image and run the FastAPI app on gunicorn, this get path does not respond. Using the /docs, hitting this end point will just wait for several minutes and the worker terminates, without any errors logged

I am starting to think Polars multithread is not playing well with FastAPI'S concurrency. But I an unable to find related documents to get an understanding. Please help, would absolutely hate to abandon Polars.

Troubleshooting done so far:

  1. The get request works perfectly when I test it locally.
  2. log on to the running docker container and run the above pl commands - it works
  3. Just tried to print the schema of the dataframe - it works. So the dataframe is created and metadata available. I get this issue only when I run filter or Any transform on the polars dataframe
  4. Created a lazy frame and tried to collect, but no luck
  5. Remove async from the method, no luck
  6. Changed python version from 3.8 to 3.11, no luck
  7. Spcifying the platform to linus/amd64 while running the docker, no luck
Chris
  • 18,724
  • 6
  • 46
  • 80
  • If you change the return to something generic like "hello world" instead of returning dict, does that work? It seems you haven't yet ruled out that the problem has nothing to do with polars. Separately, I don't know about fastapi, but in Flask, I think you'd `jsonify` the dicts before you actually return them. – Dean MacGregor Feb 21 '23 at 11:29
  • Thank you. But that did not help. I have returned DataFrame.schema on that path, which works. Its only when I apply filter or limit or any transform that the path becomes non responsive. – Data Analyst Feb 21 '23 at 17:25
  • Put some logging above each line so line before the read_pq just put "received request" then do f"loaded pq file with shape= {polars_df.shape}" and so on... Also, try a different smaller file so you can try without `limit` – Dean MacGregor Feb 21 '23 at 19:49

2 Answers2

2

Alright, found below details that explain the issue I was facing - https://pola-rs.github.io/polars-book/user-guide/howcani/multiprocessing.html

So the change I had to make was to remove the file read to avoid passing the file lock on to the newly forked process/thread. Refactor the read operation into a function.

Working code:

@geodataRouter.get("/polars-test")
async def polars_test():
    ALL_AREA_KEYS_PL = get_all_area_keys_pl()
    df = ALL_AREA_KEYS_PL.limit(3)
    return df.to_dicts()

def get_all_area_keys_pl():
    ALL_AREA_KEYS_PL = pl.read_parquet(f"/data/all_area_keys.parquet")
    return ALL_AREA_KEYS_PL

In retrospect, I should not have had IO operations in an Fastapi async def to beging with.

  • Please have a look at [this answer](https://stackoverflow.com/a/71517830/17865804), which explains the difference between normal `def` and `async def` endpoints in FastAPI, and provides solutions when one needs to run blocking I/O-bound or CPU-bound operations inside `aync def` endpoints. – Chris Feb 22 '23 at 05:03
  • 1
    Also, how exactly your answer solves the issue? Moving the `.read_parquet()` method to another function that is again called within the `async def` funciton does not seem like a solution/or anything different from the code snippet in the quesiton. – Chris Feb 22 '23 at 05:06
  • Also, please have a look at [this answer](https://stackoverflow.com/a/73580096/17865804) and [this answer](https://stackoverflow.com/a/73694164/17865804), which provide details and solutions on efficiently returning dataframe results to the client. Additionally, you seem to be loading the same dataframe/parquet over and over again (every time the endpoint is called). If that's the case, I would recommend loading the dataframe *once* at startup and store it on the app instance (see [here](https://stackoverflow.com/a/71613757/17865804) and [here](https://stackoverflow.com/a/71298949/17865804)). – Chris Feb 22 '23 at 05:41
0

Having recently switched from pandas to polars for a Dash plotly app (flask also), I can definitely confirm the two don't play well together. The speed increase on the DF operations themselves is impressive (all cores enganged, I'm seeing 10-12x increases in some cases) but once I started replacing pandas code with polars in a computations.py that has a few functions returning DFs, basic operations like reading a file, joining DFs have become extremely hard to debug or even execute.

I've spent hours today trying to figure out why a bunch of files weren't being read. I eventually set the parallel argument to 'none' in .read_parquet and that particular block now works. It now stops at a .join. I have dozens of operations that I managed to parallelize with polars but it seems a nightmare to add this to a web app. Doing a test run with the production recommended settings for Dash i.e. gunicorn, redis for cache and celery with different settings for workers and such and I got a lot of leaked semaphore objects to clean up errors and a perfectly working app now has a lot of choke points.

Not sure where to go from here, I'll probably build a FastAPI endpoint that returns JSONs and read those responses into pandas in Dash for plotting.

Anybody else trying to integrate polars into Dash?