3

I have some numpy array, whose number of rows (axis=0) is the same as a pandas dataframe's number of rows.

I want to create a new column in the dataframe, for which each entry would be a numpy array of a lesser dimension.

Code:

    some_df = pd.DataFrame(columns=['A'])
    for i in range(10):
        some_df.loc[i] = [np.random.rand(4, 6, 8)

    data = np.stack(some_df['A'].values)  #shape (10, 4, 6, 8)
    processed = np.max(data, axis=1)  # shape (10, 6, 8)

    some_df['B'] = processed  # This fails

I want the new column 'B' to contain numpy arrays of shape (6, 8)

How can this be done?

cs95
  • 379,657
  • 97
  • 704
  • 746
Gulzar
  • 23,452
  • 27
  • 113
  • 201
  • This is not your first question about storing NumPy arrays inside DataFrame cells--in https://stackoverflow.com/questions/56617760/ just recently I suggested this is the wrong approach. It's still the wrong approach. – John Zwinck Jun 16 '19 at 10:50
  • @JohnZwinck Even if it is the wrong approach, this is still what I would like to do right now. – Gulzar Jun 16 '19 at 10:56
  • The reason being I would like something quick which does not require me to debug indexing. I simply can't afford it right now. – Gulzar Jun 16 '19 at 10:57
  • I agree with comment above it is not recomendded, but maybe `some_df['B'] = [x for x in processed]` should working. – jezrael Jun 16 '19 at 10:58
  • Or `some_df['B'] = processed.tolist()` – jezrael Jun 16 '19 at 11:00
  • @JohnZwinck You were oh-so-right. I sweated blood due to this. Here is a better way to go. https://stackoverflow.com/a/67180581/913098 – Gulzar Apr 20 '21 at 14:11

3 Answers3

5

This is not recommended, it is pain, slow and later processing is not easy.

One possible solution is use list comprehension:

some_df['B'] = [x for x in processed]

Or convert to list and assign:

some_df['B'] = processed.tolist()
Gulzar
  • 23,452
  • 27
  • 113
  • 201
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Coming back to this after 2 years, here is a much better practice:

from itertools import product, chain
import pandas as pd
import numpy as np
from typing import Dict


def calc_col_names(named_shape):
    *prefix, shape = named_shape
    names = [map(str, range(i)) for i in shape]
    return map('_'.join, product(prefix, *names))


def create_flat_columns_df_from_dict_of_numpy(
        named_np: Dict[str, np.array],
        n_samples_per_np: int,
):
    named_np_correct_lenth = {k: v for k, v in named_np.items() if len(v) == n_samples_per_np}
    flat_nps = [a.reshape(n_samples_per_np, -1) for a in named_np_correct_lenth.values()]
    stacked_nps = np.column_stack(flat_nps)
    named_shapes = [(name, arr.shape[1:]) for name, arr in named_np_correct_lenth.items()]
    col_names = [*chain.from_iterable(calc_col_names(named_shape) for named_shape in named_shapes)]
    df = pd.DataFrame(stacked_nps, columns=col_names)
    df = df.convert_dtypes()
    return df


def parse_series_into_np(df, col_name, shp):
    # can parse the shape from the col names
    n_samples = len(df)
    col_names = sorted(c for c in df.columns if col_name in c)
    col_names = list(filter(lambda c: c.startswith(col_name + "_") or len(col_names) == 1, col_names))
    col_as_np = df[col_names].astype(np.float).values.reshape((n_samples, *shp))
    return col_as_np

usage to put a ndarray into a Dataframe:

full_rate_df = create_flat_columns_df_from_dict_of_numpy(
    named_np={name: np.array(d[name]) for name in ["name1", "name2"]},
    n_samples_per_np=d["name1"].shape[0]
)

where d is a dict of nd arrays of the same shape[0], hashed by ["name1", "name2"].

The reverse operation can be obtained by parse_series_into_np.


The accepted answer remains, as it answers the original question, but this one is a much better practice.

Gulzar
  • 23,452
  • 27
  • 113
  • 201
0

I know this question already has an answer to it, but I would like to add a much more scalable way of doing this. As mentioned in the comments above it is in general not recommended to store arrays as "field"-values in a pandas-Dataframe column (I actually do not know why?). Nevertheless, in my day to day work this is an extermely important functionality when working with time-series data and a bunch of related meta-data. In general I organize my experimantal time-series in form of pandas dataframes with one column holding same-length numpy arrays and the other columns containing information on meta-data with respect to certain measurement conditions etc.

The proposed solution by jezrael works very well, and I used this for the last 4 years on a regular basis. But this method potentially encounters huge memory problems. In my case I came across these problems working with dataframes beyond 5 Million rows and time-series with approx. 100 data points.

The solution to these problems is extremely simple, since I did not find it anywhere I just wanted to share it here: Simply transform your 2D array to a pandas-Series object and assign this to a column of your dataframe:

df["new_list_column"] = pd.Series(list(numpy_array_2D))
Patrick
  • 41
  • 1
  • 5
  • I did this once, and never again. The reason (you asked why) is this makes you lose all pandas functionality, as numeric types are expected, and this forces object types. What I do now is split the array to multiple columns, and have the 1st dimension as the index in pandas. This retains pandas functionality, and can later be aggregated back into a numpy ndarray by smart naming of columns. – Gulzar Apr 20 '21 at 13:58
  • This also forces a `np.concatenate()` call every time you want to access the data, and doesn't allow to plot or see errors or use Pycharm's "view as dataframe" or anything else pandas offers. For your sanity - don't do this. – Gulzar Apr 20 '21 at 13:59