5

Joblib has functionality for sharing Numpy arrays across processes by automatically memmapping the array. However this makes use of Numpy specific facilities. Pandas does use Numpy under the hood, but unless your columns all have the same data type, you can't really serialize a DataFrame to a single Numpy array.

What would be the "right" way to cache a DataFrame for reuse in Joblib?

My best guess would be to memmap each column separately, then reconstruct the dataframe inside the loop (and pray that Pandas doesn't copy the data). But that seems like a pretty intensive process.

I am aware of the standalone Memory class, but it's not clear if that can help.

shadowtalker
  • 12,529
  • 3
  • 53
  • 96
  • hey, did you make any progress about this issue? – MehmedB Jul 02 '19 at 13:16
  • @MehmedB I think somewhere on Github I was told that a DataFrame is cached transparently. – shadowtalker Jul 02 '19 at 17:49
  • @shadowtalker could you add an answer with an example of how to cache a DataFrame? – Dan May 29 '20 at 21:20
  • @Dan I do not have a definitive answer. As I said in my comment, I have a faint memory of a Github issue reply stating that DataFrames are cached transparently (meaning that you don't need to take any special action), but I don't have any confirmation. – shadowtalker May 30 '20 at 01:26

0 Answers0