0

Let's say generically my setup is like this:

from fastapi import FastAPI, Response
import pyarrow as pa
import pyarrow.ipc as ipc
app = FastAPI()


@app.get("/api/getdata")
async def getdata():
    table = pa.Table.from_pydict({
      "name": ["Alice", "Bob", "Charlie"], 
      "age": [25, 30, 22]})

    ### Not really sure what goes here
    ## something like this...
sink = io.BytesIO()
with ipc.new_file(sink, table.schema) as writer:
    for batch in table.to_batches():
        writer.write(batch)
sink.seek(0)
return StreamingResponse(content=sink, media_type="application/vnd.apache.arrow.file")
        
    

This works but I'm copying the whole table to BytesIO first? It seems like what I need to do is make a generator that yields whatever writer.write(batch) writes to the Buffer instead of actually writing it but I don't know how to do that. I tried using the pa.BufferOutputStream instead of BytesIO but I can't put that in as a return object for fastapi.

My goal is to be able to get the data on the js side like this...

import { tableFromIPC } from "apache-arrow";
const table = await tableFromIPC(fetch("/api/getdata"));
console.table([...table]);

In my approach, this works, I'd just like to know if there's a way to do this without the copying.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • You might find [this](https://stackoverflow.com/a/75760884/17865804), as well as [this](https://stackoverflow.com/a/75837557/17865804) and [this](https://stackoverflow.com/a/73672334/17865804) helpful – Chris Jul 27 '23 at 12:26

2 Answers2

0

You can get the finalised buffer using sink.getvalue() but then you need to convert it to a byte array which is a copy. I've not found a way (yet) to transfer to the FastAPI Response without copying the buffer:

pybytes = sink.getvalue().to_pybytes()
return Response(content=pybytes,
                media_type="application/vnd.apache.arrow.stream")
Tomb
  • 1
0

Yes, you can achieve streaming without copying the entire table into BytesIO. Instead of using BytesIO, you can use a generator that yields the data in batches. To do this, you can utilize the pa.BufferOutputStream as you mentioned. However, you can wrap the generator in an async function since FastAPI's StreamingResponse requires an async iterator.

description:

description

By using this approach, you are efficiently streaming the Arrow data in batches, and you won't be copying the entire table into memory before streaming it. On the JavaScript side, you can use the tableFromIPC function as you demonstrated, and it will handle the streamed data transparently.

karel
  • 5,489
  • 46
  • 45
  • 50
Ramez
  • 1
  • 1