I'm working on an API using FastAPI that users can make a request to in order for the following to happen:
- First, a get request will grab a file from Google Cloud Storage and load it into a pyspark DataFrame
- Then the application will perform some transformations on the DataFrame
- Finally, I want to write the DataFrame to the user's disk as a parquet file.
I can't quite figure out how to deliver the file to the user in parquet format, for a few reasons:
df.write.parquet('out/path.parquet')
writes the data into a directory atout/path.parquet
which presents a challenge when I try to pass it tostarlette.responses.FileResponse
- Passing a single .parquet file that I know exists to
starlette.responses.FileResponse
just seems to print the binary to my console (as demonstrated in my code below) - Writing the DataFrame to a BytesIO stream like in pandas seemed promising, but I can't quite figure out how to do that using any of DataFrame's methods or DataFrame.rdd's methods.
Is this even possible in FastAPI? Is it possible in Flask using send_file()?
Here's the code I have so far. Note that I've tried a few things like the commented code to no avail.
import tempfile
from fastapi import APIRouter
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from starlette.responses import FileResponse
router = APIRouter()
sc = SparkContext('local')
spark = SparkSession(sc)
df: spark.createDataFrame = spark.read.parquet('gs://my-bucket/sample-data/my.parquet')
@router.get("/applications")
def applications():
df.write.parquet("temp.parquet", compression="snappy")
return FileResponse("part-some-compressed-file.snappy.parquet")
# with tempfile.TemporaryFile() as f:
# f.write(df.rdd.saveAsPickleFile("temp.parquet"))
# return FileResponse("test.parquet")
Thanks!
Edit: I tried using the answers and info provided here, but I can't quite get it working.