24

It seems that I can memmap the underlying data for a python series by creating a mmap'd ndarray and using it to initialize the Series.

        def assert_readonly(iloc):
           try:
               iloc[0] = 999 # Should be non-editable
               raise Exception("MUST BE READ ONLY (1)")
           except ValueError as e:
               assert "read-only" in e.message

        # Original ndarray
        n = 1000
        _arr = np.arange(0,1000, dtype=float)

        # Convert it to a memmap
        mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype)
        mm[:] = _arr[:]
        del _arr
        mm.flush()
        mm.flags['WRITEABLE'] = False  # Make immutable!

        # Wrap as a series
        s = pd.Series(mm, name="a")
        assert_readonly(s.iloc)

Success! Its seems that s is backed by a read-only mem-mapped ndarray. Can I do the same for a DataFrame? The following fails

        df = pd.DataFrame(s, copy=False, columns=['a'])
        assert_readonly(df["a"]) # Fails

The following succeeds, but only for one column:

        df = pd.DataFrame(mm.reshape(len(mm,1)), columns=['a'], copy=False)
        assert_readonly(df["a"]) # Succeeds

... so I can make a DF without copying. However, this only works for one column, and I want many. Method I've found for combining 1-column DFs: pd.concat(..copy=False), pd.merge(copy=False), ... result in copies.

I have some thousands of large columns as datafiles, of which I only ever need a few at a time. I was hoping I'd be able to place their mmap'd representations in a DataFrame as above. Is it possible?

Pandas documentation makes it a little difficult to guess about what's going on under the hood here - although it does say a DataFrame "Can be thought of as a dict-like container for Series objects.". I'm beginning to this this is no longer the case.

I'd prefer not to need HD5 to solve this.

user48956
  • 14,850
  • 19
  • 93
  • 154

2 Answers2

28

OK... after a lot of digging here's what's going on.

While pandas maintains a reference to the supplies array for a series when the copy=False parameter is supplied to the constructor:

import pandas as pd
import numpy as np

a = np.array([1, 2, 3])  # Let's imagine this our memmap array
s = pd.Series(data=a, copy=False)
assert s.to_numpy() is a  # Yes!

It does not for a DataFrame:

coldict = dict(col1=a,
               # col2=np.array([1.1, 2.2, 3.3]),  # See below
               # col3=np.array([11, 12, 13])
               )
df = pd.DataFrame(data=coldict, copy=False)
assert df["col1"].to_numpy() is a      # Nope! Not even for pandas >=1.3
assert df["col1"].values is a          # Nope!

Pandas' DataFrame uses the BlockManager class to organize the data internally. Contrary to the docs, DataFrame is NOT a collection of series but a collection of similarly dtyped matrices. BlockManger groups all the float columns together, all the int columns together, etc..., and their memory (from what I can tell) is kept together.

It can do that without copying the memory ONLY if a single ndarray matrix (a single type) is provided. Note, BlockManager (in theory) also supports not-copying mixed type data in its construction as it may not be necessary to copy this input into same-typed chunked. However, the DataFrame constructor doesn't make a copy ONLY if a single matrix is the data parameter.

In short, if you have mixed types or multiple arrays as input to the constructor, or provide a dict with a single array, you are out of luck in Pandas, and DataFrame's default BlockManager will copy your data.

In any case, one way to work around this is to force BlockManager to not consolidate-by-type, but to keep each column as a separate 'block'. So, with monkey-patching magic...

from pandas.core.internals import BlockManager
class BlockManagerUnconsolidated(BlockManager):
    def __init__(self, *args, **kwargs):
        BlockManager.__init__(self, *args, **kwargs)
        self._is_consolidated = False
        self._known_consolidated = False

    def _consolidate_inplace(self): pass
    def _consolidate(self): return self.blocks


def df_from_arrays(arrays, columns, index):
    from pandas.core.internals import make_block
    def gen():
        _len = None
        p = 0
        for a in arrays:
            if _len is None:
                _len = len(a)
                assert len(index) == _len
            assert _len == len(a)
            yield make_block(values=a.reshape((1,_len)), placement=(p,))
            p += 1

    blocks = tuple(gen())
    mgr = BlockManagerUnconsolidated(blocks=blocks, axes=[columns, index])
    return pd.DataFrame(mgr, copy=False)

It would be better if DataFrame or BlockManger had a consolidate=False (or assumed this behavior) if copy=False was specified.

To test:

def assert_readonly(iloc):
    try:
        iloc[0] = 999 # Should be non-editable
        raise Exception("MUST BE READ ONLY (1)")
    except ValueError as e:
        assert "read-only" in e.message

# Original ndarray
n = 1000
_arr = np.arange(0,1000, dtype=float)

# Convert it to a memmap
mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype)
mm[:] = _arr[:]
del _arr
mm.flush()
mm.flags['WRITEABLE'] = False  # Make immutable!

df = df_from_arrays(
    [mm, mm, mm],
    columns=['a', 'b', 'c'],
    index=range(len(mm)))
assert_read_only(df["a"].iloc)
assert_read_only(df["b"].iloc)
assert_read_only(df["c"].iloc)

It seems a little questionable to me whether there's really practical benefits to BlockManager requiring similarly typed data to be kept together -- most of the operations in Pandas are label-row-wise, or per column -- this follows from a DataFrame being a structure of heterogeneous columns that are usually only associated by their index. Though feasibly they're keeping one index per 'block', gaining benefit if the index keeps offsets into the block (if this was the case, then they should groups by sizeof(dtype), which I don't think is the case).
Ho hum...

There was some discussion about a PR to provide a non-copying constructor, which was abandoned.

It looks like there's sensible plans to phase out BlockManager, so your mileage may vary.

Also see Pandas under the hood, which helped me a lot.

Michel de Ruiter
  • 7,131
  • 5
  • 49
  • 74
user48956
  • 14,850
  • 19
  • 93
  • 154
3

If you change your DataFrame constructor to add the parameter copy=False you will have the behavior you want. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

Edit: Also, you want to use the underlying ndarray (rather than the pandas series).

dllahr
  • 421
  • 1
  • 4
  • 13
  • Hmmm... I was sure this was the right answer. It turns out that copy=False is the default parameter value, and doesn't fix this problem (writing to s2.iloc[0] doesn't modify the original data). I wonder if the behavior has changed. – user48956 Nov 30 '17 at 01:05
  • the copy=False parameter only applies to DataFrames and ndarray - the s you're initializing with is a Series. That may be an edge case. Perhaps you should initialize with mm or the matrix equivalent memmap into you dataframe? – dllahr Nov 30 '17 at 14:14
  • Yes - this works, for a single column I can make a DF from an nx1 array. However, I want many columns. My original data is pages in from many files (one per column). Numpy doesn't seems to support vstacking without copying, so I can't initialize the DF from a matrix this way. It seems a DF *can* reference without copying - but oddly not for multiple columns (?). – user48956 Dec 19 '17 at 17:51
  • 1
    I think this answers the question of why you can't concatenated w/o copying: https://stackoverflow.com/questions/7869095/concatenate-numpy-arrays-without-copying – dllahr Dec 20 '17 at 01:17
  • It recommends that you initialize an empty array of the correct size, and then load each individual file into one of the columns of this array. Then you can back a DataFrame with that array. – dllahr Dec 20 '17 at 01:18
  • 2
    I see. Another option then is to implement a class that implements the numpy array interface, it has a list of your columns which are the memmaps are each file, and handles access of them. This should then be able to back the DataFame. Alternatively, you re-write your data to be one single file and then memmap that. – dllahr Dec 20 '17 at 05:54