Performance of creating new DataFrame

Question

I was very surpised about timings of creating DataFrames in this question:

#[30000 rows x 2 columns]
df = pd.concat([pd.DataFrame({'fruits': ['apples', 'grapes', 'figs'], 
                              'numFruits': [10, 20, 15]})]*10000)
       .reset_index(drop=True)    
#print (df)


In [55]: %timeit (pd.DataFrame([df.numFruits.values], ['Market 1 Order'], df.fruits.values))
1 loop, best of 3: 2.4 s per loop

In [56]: %timeit (pd.DataFrame(df.numFruits.values.reshape(1,-1), index=['Market 1 Order'], columns=df.fruits.values))
The slowest run took 5.64 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 424 µs per loop

What is reason?

Why this huge difference in numpy.ndarray.reshape vs [] ?

I think the main difference here is that when passing `.values.reshape` the shape and dtype is already compatible with pandas and that it can just take a view on the underlying memory without any allocation, whilst for the list type it has to detect the shape and then infer compatible dtypes and copy the values to newly allocated memory — EdChum, Jan 25 '17 at 22:09

score 3 · Accepted Answer · answered Jan 25 '17 at 22:32

After some painful debugging I can confirm the sequence that the slow one takes, in the DataFrame ctor :

 elif isinstance(data, (list, types.GeneratorType)):
            if isinstance(data, types.GeneratorType):
                data = list(data)
            if len(data) > 0:
                if is_list_like(data[0]) and getattr(data[0], 'ndim', 1) == 1:
                    if is_named_tuple(data[0]) and columns is None:
                        columns = data[0]._fields
                    arrays, columns = _to_arrays(data, columns, dtype=dtype)

Here it tests the type of the passed data, as it's list-like it then tries to test each element for it's type, it's not expecting a list containing a np array so then it comes here:

def _to_arrays(data, columns, coerce_float=False, dtype=None):
    """
    Return list of arrays, columns
    """
    if isinstance(data, DataFrame):
        if columns is not None:
            arrays = [data._ixs(i, axis=1).values
                      for i, col in enumerate(data.columns) if col in columns]
        else:
            columns = data.columns
            arrays = [data._ixs(i, axis=1).values for i in range(len(columns))]

        return arrays, columns

    if not len(data):
        if isinstance(data, np.ndarray):
            columns = data.dtype.names
            if columns is not None:
                return [[]] * len(columns), columns
        return [], []  # columns if columns is not None else []
    if isinstance(data[0], (list, tuple)):
        return _list_to_arrays(data, columns, coerce_float=coerce_float,
                               dtype=dtype)

then here:

def _list_to_arrays(data, columns, coerce_float=False, dtype=None):
    if len(data) > 0 and isinstance(data[0], tuple):
        content = list(lib.to_object_array_tuples(data).T)
    else:
        # list of lists
        content = list(lib.to_object_array(data).T)
    return _convert_object_array(content, columns, dtype=dtype,
                                 coerce_float=coerce_float)

and finally here:

def _convert_object_array(content, columns, coerce_float=False, dtype=None):
    if columns is None:
        columns = _default_index(len(content))
    else:
        if len(columns) != len(content):  # pragma: no cover
            # caller's responsibility to check for this...
            raise AssertionError('%d columns passed, passed data had %s '
                                 'columns' % (len(columns), len(content)))

    # provide soft conversion of object dtypes
    def convert(arr):
        if dtype != object and dtype != np.object:
            arr = lib.maybe_convert_objects(arr, try_float=coerce_float)
            arr = _possibly_cast_to_datetime(arr, dtype)
        return arr

    arrays = [convert(arr) for arr in content]

    return arrays, columns

You can see that there is no optimisation in the construction it performs and it essentially just iterates through every element, converts it (which will copy it) and returns a list of arrays.

For the other path, as the np array shape and dtypes are more pandas friendly it can take a view on the data or copy if required but it already knows enough to optimise the construction

Thank you for your answer, now I understand it better. I will wait a few hours and then accept the best answer, now it is your. — jezrael, Jan 25 '17 at 22:40
Looking at the code path it could test if the first element is a np array and if so optimise the construction so it can take a fast path but it's not expecting this and really it then adds complexity to test if a set element or tuple element is ndarray or not is why it doesn't test for this — EdChum, Jan 25 '17 at 22:42

Steven G · Answer 2 · 2017-01-26T17:54:34.943

@EdChum comments is on point

just looking at how pandas handle list data vs array data you will understand quickly that passing a list is more complicated.

array:

elif isinstance(data, (np.ndarray, Series, Index)):
            if data.dtype.names:
                data_columns = list(data.dtype.names)
                data = dict((k, data[k]) for k in data_columns)
                if columns is None:
                    columns = data_columns
                mgr = self._init_dict(data, index, columns, dtype=dtype)
            elif getattr(data, 'name', None):
                mgr = self._init_dict({data.name: data}, index, columns,
                                      dtype=dtype)
            else:
                mgr = self._init_ndarray(data, index, columns, dtype=dtype,copy=copy)

now if it's a list:

elif isinstance(data, (list, types.GeneratorType)):
    if isinstance(data, types.GeneratorType):
        data = list(data)
    if len(data) > 0:
        if is_list_like(data[0]) and getattr(data[0], 'ndim', 1) == 1:
            if is_named_tuple(data[0]) and columns is None:
                columns = data[0]._fields
            arrays, columns = _to_arrays(data, columns, dtype=dtype)
            columns = _ensure_index(columns)

            # set the index
            if index is None:
                if isinstance(data[0], Series):
                    index = _get_names_from_index(data)
                elif isinstance(data[0], Categorical):
                    index = _default_index(len(data[0]))
                else:
                    index = _default_index(len(data))

            mgr = _arrays_to_mgr(arrays, columns, index, columns,
                                 dtype=dtype)
        else:
            mgr = self._init_ndarray(data, index, columns, dtype=dtype,
                                     copy=copy)

Hmmm, but I pass numpy arrays in both ways - or am I something missing? — jezrael, Jan 25 '17 at 22:30
you pass a numpy array in a list doing `[df.numFruits.values]`, you can check it `type([df.numFruits.values])` will return `list`. so when it checks data it will get in the `isinstance(data, (list, types.generatorType))` part of the if. — Steven G, Jan 26 '17 at 14:08

Performance of creating new DataFrame

2 Answers2

Linked