3

I have a desktop application where the majority of calculations (>90%) happen on the Rust side of it. But I want the user to be able to write scripts in Python that will operate on the df.

Can this be done without serializing the dataframe between runtimes to a file?

A simple invocation could be this:

Rust: agg -> Rust: calculate new column -> Python: groupby -> Rust: count results

The serializing approach works for small datasets. It doesn't really scale to larger ones. The optimal solution would somehow be to be able to tell the python side: Here is a lazy dataframe in-memory. You can manipulate it.

I've read the documentation and the only solution I could see is to use Apache IPC.

mainrs
  • 43
  • 3

1 Answers1

3

The Lazyframes are (mostly) serializable to json. The serialize operations are fallible, so depending on the type of query, it may not serialize. The languages all have slightly different apis for this operation though.

This serializes the LogicalPlan itself & doesnt perform any computations on the dataframes. This makes the operation very inexpensive.

Python

lf = get_lf_somehow()
json_buf = lf.write_json(to_string=True)
lf = pl.LazyFrame.from_json(json)

Node.js

let lf = get_lf_somehow()
const json_buf = lf.serialize('json')
lf = pl.LazyFrame.deserialize(buf , 'json')

Rust

let lf = get_lf_somehow();
let json_buf = serde_json::to_vec(&lf)?;
let lf = serde_json::from_slice(&json_buf)?;
Cory Grinstead
  • 511
  • 3
  • 16
  • 1
    When deserializing, how do you "restore the connection" between the lazyframe and the actual data? – creanion Jun 15 '22 at 19:44
  • 1
    not sure I understand what you mean by "restore the connection". Are you referring to when `get_lf_somehow` would be constructed from a literal dataframe via `df.lazy()`? – Cory Grinstead Jun 15 '22 at 20:44
  • yes that's what I mean – creanion Jun 16 '22 at 19:48
  • when serializing a lazyframe that has a literal dataframe as the source, it will serialize the entire dataframe, which may not be practical for large datasets. If you want to pass the literal dataframe across boundaries, IPC will be the most performant. – Cory Grinstead Jun 16 '22 at 19:55
  • Does polars provide some abstraction over this case? Sharing the logical plan alongside the serialized dataframe over IPC? If not, how would one do that exactly? – mainrs Jun 21 '22 at 11:29
  • 1
    I think that would be the way to go. you would need to create some abstraction that created the lazyframe separately from the dataframe. it may also be possible to use the arrow metadata spec to directly pass the entire body as IPC. I'm not quite sure how flexible it is though. Ideally one *should* be able to serialize directly to IPC just like JSON or Bincode. – Cory Grinstead Jun 21 '22 at 13:52
  • Is it possible to ever have the possibility to serialize lambdas? Since languages have different ways of implementing these it seems impossible. A way of handing over the calculation seems really out-of-scope of polars though. – mainrs Jul 26 '22 at 16:47
  • did you ever find a solution for this? – ffff Aug 14 '23 at 11:03