3

I have Apache Arrow data on the server (Python) and need to use it in the browser. It appears that Arrow Flight isn't implemented in JS. What are the best options for sending the data to the browser and using it there?

I don't even need it necessarily in Arrow format in the browser. This question hasn't received any responses, so I'm adding some additional criteria for what I'm looking for:

  • Self-describing: don't want to maintain separate schema definitions
  • Minimal overhead: For example, an array of float32s should transfer as something compact like a data type indicator, length value and sequence of 4-byte float values
  • Cross-platform: Able to be easily sent from Python and received and used in the browser in a straightforward way

Surely this is a solved problem? If it is I've been unable to find a solution. Please help!

Brian
  • 563
  • 1
  • 5
  • 15
  • You don't _have_ to use Flight to send Arrow data. You could serialize the Arrow data using the Arrow stream writer, send it over HTTP or WebSockets, and load it from the browser using the Arrow JavaScript libraries. – li.davidm Jan 05 '23 at 13:56
  • @li.davidm Thank you for the tip! Can you please provide an example of doing this in Python? I'm looking at the library and it's not clear to me. On the Python side it looks like the streaming functions expect to stream to files, whereas I'll be trying to send it via a Starlette response which accepts a generator or iterator for streaming responses. On the browser side I think `tableFromIPC()` will work as long as I use a fetch request to read the data. – Brian Jan 05 '23 at 15:17
  • In Python, a 'file' can also just be a BytesIO, which you can presumably then stuff into Starlette (not familiar with the framework myself). Does that work? I can write out a longer answer later – li.davidm Jan 05 '23 at 19:17
  • Ah, if you do want to stream data, it gets a little more complicated... – li.davidm Jan 05 '23 at 19:18
  • I don't necessarily need streaming. I was looking at streaming because you mentioned the arrow stream writer. Serializing to bytesio would be fine for now. – Brian Jan 05 '23 at 20:08
  • It's a little confusing but the 'stream' in 'stream writer' refers to the _streaming IPC format_. You can use it to serialize to a BytesIO. Just pass a BytesIO as the 'file' to the writer. (The streaming format doesn't have a 'footer' for random access. The file format does. That's basically it.) – li.davidm Jan 05 '23 at 20:11
  • I've been trying to get this to work and think I have the Python side working. Now the JS side is problematic. I'm not sure that I can the arrow JS code working with the old version of Angular I'm using (v11) due to problems importing from es modules. Thanks for your help on this, but I think I've hit my time limit for this issue for now. – Brian Jan 05 '23 at 23:51

1 Answers1

2

Building off of the comments on your original post by David Li, you can implement a non-streaming version what you want without too much code using PyArrow on the server side and the Apache Arrow JS bindings on the client. The Arrow IPC format satisfies your requirements because it ships the schema with the data, is space-efficient and zero-copy, and is cross-platform.

Here's a toy example showing generating a record batch on server and receiving it on the client:

Server:

from io import BytesIO

from flask import Flask, send_file
from flask_cors import CORS
import pyarrow as pa

app = Flask(__name__)
CORS(app)

@app.get("/data")
def data():
    data = [
        pa.array([1, 2, 3, 4]),
        pa.array(['foo', 'bar', 'baz', None]),
        pa.array([True, None, False, True])
    ]
    batch = pa.record_batch(data, names=['f0', 'f1', 'f2'])

    sink = pa.BufferOutputStream()

    with pa.ipc.new_stream(sink, batch.schema) as writer:
        writer.write_batch(batch)

    return send_file(BytesIO(sink.getvalue().to_pybytes()), "data.arrow")

Client

const table = await tableFromIPC(fetch(URL));
// Do what you like with your data

Edit: I added a runnable example at https://github.com/amoeba/arrow-python-js-ipc-example.

amoeba
  • 4,015
  • 3
  • 21
  • 14
  • why is there so much boilerplate? If I understand it correctly, data is missing the column names, this is fixed in `batch=...`. Then I am unsure what the BufferOutputStream is doing, and then the new_stream, and then everything is converted to BytesIO and finally sent? It also seems like this would always be the same, why can I not `send(batch)` I assume the boilerplate must exist for a reason – Felix B. May 04 '23 at 09:59