1

I would like to know how to stream a DataFrame using FastAPI without having to save the DataFrame to a csv file on disk. Currently, what I managed to do is to stream data from the csv file, but the speed was not very fast compared to returning a FileResponse. The /option7 below is what I'm trying to do.

My goal is to stream data from FastAPI backend without saving the DataFrame to a csv file.

Thank you.

from fastapi import FastAPI, Response,Query
from fastapi.responses import FileResponse,HTMLResponse,StreamingResponse
app = FastAPI()

df = pd.read_csv("data.csv")

@app.get("/option4")
def load_questions():
    return FileResponse(path="C:Downloads/data.csv", filename="data.csv")

@app.get("/option5")
def load_questions():
    def iterfile():  # 
        with open('data.csv', mode="rb") as file_like:  # 
            yield from file_like  # 

    return StreamingResponse(iterfile(), media_type="text/csv")

@app.get("/option7")
def load_questions():
    def iterfile():  # 
        #with open(df, mode="rb") as file_like:  # 
        yield from df  # 

    return StreamingResponse(iterfile(), media_type="application/json")


Chris
  • 18,724
  • 6
  • 46
  • 80
new_dev
  • 47
  • 1
  • 7
  • There is no dataframe in this question that I could see? – AKX Sep 12 '22 at 11:37
  • yeah, because the code that i build for dataframe didnt work – new_dev Sep 12 '22 at 11:38
  • 1
    Does this answer your question? [How to return a .csv file/Pandas DataFrame in JSON format using FastAPI?](https://stackoverflow.com/questions/71203579/how-to-return-a-csv-file-pandas-dataframe-in-json-format-using-fastapi) – Chris Sep 12 '22 at 11:40
  • Thanks for your reply, i tried these function but issue is that i have huge data and if i make options in the link then my ram will be full – new_dev Sep 12 '22 at 11:41
  • 1
    Why even use Pandas to load a CSV into a DF if you're just going to emit the same CSV anyway..? – AKX Sep 12 '22 at 11:41
  • currently what im doing is saving the dataframe as csv after i did full prepossing in order to make it accessible in api. i need a way to eliminate the need for saving dataframe as csv – new_dev Sep 12 '22 at 11:43
  • @new_dev a dataframe isn't a file format. It's an in-memory data structure. If you want to send its data to a caller you'll have to serialize it to some file format that the client can understand. It could be any format: JSON, CSV, parquet, whatever you want. – Panagiotis Kanavos Sep 13 '22 at 09:11
  • @new_dev where does the data come from in the first place? Why not send that data directly instead of loading it into a dataframe first? If your data is in a CSV file, just send the file – Panagiotis Kanavos Sep 13 '22 at 09:12
  • @Chris not exactly. `to_csv()` without a path will do the same. In either cases, the entire string will be allocated in memory before it's returned. That's not scalable. – Panagiotis Kanavos Sep 13 '22 at 10:49

1 Answers1

3

Approach 1 (recommended)

As mentioned in this answer, as well as here and here, when the entire data (a DataFrame in your case) is already loaded into memory, there is no need to use StreamingResponse. StreamingResponse makes sense when you want to transfer real-time data and when you don't know the size of your output ahead of time, and you don't want to wait to collect it all to find out before you start sending it to the client, as well as when a file that you would like to return is too large to fit into memory—for instance, if you have 8GB of RAM, you can't load a 50GB file—and hence, you would rather load the file into memory in chunks.

In your case, as the DataFrame is already loaded into memory, you should instead return a custom Response directly, after using .to_json() method to convert the DataFrame into a JSON string, as described in this answer (see related posts here and here as well). Example:

from fastapi import Response

@app.get("/")
def main():
    return Response(df.to_json(orient="records"), media_type="application/json")

If you find the browser taking a while to display the data, you may want to have the data downloaded as a .json file to the user's device (which would be completed much faster), rather than waiting for the browser to display a large amount of data. You can do that by setting the Content-Disposition header in the Response using the attachment parameter (see this answer for more details):

@app.get("/")
def main():
    headers = {'Content-Disposition': 'attachment; filename="data.json"'}
    return Response(df.to_json(orient="records"), headers=headers, media_type='application/json')

You could also return the data as a .csv file, using the .to_csv() method without specifying the path parameter. Since using return df.to_csv() would result in displaying the data in the browser with \r\n characters included, you might find it better to put the csv data in a Response instead, and specify the Content-Disposition header, so that the data will be downloaded as a .csv file. Example:

@app.get("/")
def main():
    headers = {'Content-Disposition': 'attachment; filename="data.csv"'}
    return Response(df.to_csv(), headers=headers, media_type="text/csv")

Approach 2

To use a StreamingResponse, you would need to iterate over the rows in a DataFrame, convert each row into a dictionary and subsequently into a JSON string, using either the standard json library, or other faster JSON encoders, as described in this answer (the JSON string will be later encoded into byte format internally by FastAPI/Starlette, as shown in the source code here). Example:

@app.get("/")
def main():
    def iter_df():
        for _, row in df.iterrows():
            yield json.dumps(row.to_dict()) + '\n'

    return StreamingResponse(iter_df(), media_type="application/json")

Iterating through Pandas objects is generally slow and not recommended. As described in this answer:

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Update

As @Panagiotis Kanavos noted in the comments section below, using either .to_json() or .to_csv() on the DataFrame that is already loaded into memory, would result in allocating the entire output string in memory, thus doubling the RAM usage or even worse. Hence, in the case of having such a huge amount of data that may cause your system to slow down or crash (because of running out of memory) if used either method above, you should rather use StreamingResponse, as described earlier. You may find faster alernative methods to iterrows() in this post, as well as faster JSON encoders, such as orjson and ujson, as described in this answer and this answer.

Alternatively, you could save the data to disk, then delete the DataFrame to release the memory—you could even manually trigger the garbage collection using gc.collect(), as shown in this answer; however, frequent calls to garbage collection is discouraged, as it is a costly operation and may affect performance—and return a FileResponse (assuming the data can fit into RAM; otherwise, you should use StreamingResponse, see this answer, as well as this answer), and finally, have a BackgroundTask to delete the file from disk after returning the response. Example is given below:

from fastapi import BackgroundTasks
from fastapi.responses import FileResponse 
import uuid
import os

@app.get("/")
def main(background_tasks: BackgroundTasks):
    filename = str(uuid.uuid4()) + ".csv"
    df.to_csv(filename)
    del df  # release the memory
    background_tasks.add_task(os.remove, filename) 
    return FileResponse(filename, filename="data.csv", media_type="text/csv")
    # or return StreamingResponse, if the file can't fit into RAM; see linked answers above

The solution you may choose should be based on your application's requirements (e.g., the number of users you expect to serve simultaneously, the size of data, the response time, etc.), as well as your system's specifications (e.g., avaialable memory for allocation). Additionally, since all calls to DataFrame's methods are synchronous, you should remember to define your endpoint with a normal def, so that it is run in an external threadpool; otherwise, it would block the server. Alternatively, you could use Starlette's run_in_threadpool() from the concurrency module, which will run the to_csv() or to_json() function in a separate thread to ensure that the main thread (where coroutines are run) does not get blocked. Please have a look at this answer for more details on def vs async def, as well as solutions when one has to run synchronous blocking operations inside async def endpoints.

Chris
  • 18,724
  • 6
  • 46
  • 80
  • `to_csv()` without a path will do the same. Whether `to_json()` or `to_csv()` is used though, the entire output string will be allocated in memory, doubling RAM usage or worse. There's a *very* good reason to search for streaming instead of allocating the entire string in memory. – Panagiotis Kanavos Sep 13 '22 at 10:49