13

Unlike every other question I can find, I do not want to create a DataFrame from a homogeneous Numpy array, nor do I want to convert a structured array into a DataFrame.

What I want is to create a DataFrame from individual 1D Numpy arrays for each column. I tried the obvious DataFrame({"col": nparray, "col": nparray}), but this shows up at the top of my profile, so it must be doing something really slow.

It is my understanding that Pandas DataFrames are implemented in pure Python, where each column is backed by a Numpy array, so I would think there is an efficient way to do it.

What I'm actually trying to do is to fill a DataFrame efficiently from Cython. Cython has memoryviews that allow efficient access to Numpy arrays. So my strategy is to allocate a Numpy array, fill it with data and then put it in a DataFrame.

The opposite works quite fine, creating a memoryview from a Pandas DataFrame. So if there is a way to preallocate the entire DataFrame and then just pass the columns to Cython, this is also acceptable.

cdef int32_t[:] data_in = df['data_in'].to_numpy(dtype="int32")

A section of the profile of my code looks like this, where everything the code does is completely dwarfed by creating the DataFrame at the end.

         1100546 function calls (1086282 primitive calls) in 4.345 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.345    4.345 profile:0(<code object <module> at 0x7f4e693d1c90, file "test.py", line 1>)
    445/1    0.029    0.000    4.344    4.344 :0(exec)
        1    0.006    0.006    4.344    4.344 test.py:1(<module>)
     1000    0.029    0.000    2.678    0.003 :0(run_df)
     1001    0.017    0.000    2.551    0.003 frame.py:378(__init__)
     1001    0.018    0.000    2.522    0.003 construction.py:170(init_dict)

Corresponding code:

def run_df(self, df):
    cdef int arx_rows = len(df)
    cdef int arx_idx

    cdef int32_t[:] data_in = df['data_in'].to_numpy(dtype="int32")

    data_out_np = np.zeros(arx_rows, dtype="int32")
    cdef int32_t[:] data_out = data_out_np

    for arx_idx in range(arx_rows):
        self.cpp_sec_par.run(data_in[arx_idx],data_out[arx_idx],)

    return pd.DataFrame({
        'data_out': data_out_np,
    })
ntg
  • 12,950
  • 7
  • 74
  • 95
Pepijn
  • 4,145
  • 5
  • 36
  • 64
  • 1
    Are you sure that it's possible for a dataframe to operate with a "ragged" array of 1D numpy arrays? If it uses a 2D array under the hood, I don't think you're going to get around copying the arrays. – user545424 Mar 04 '19 at 17:33
  • What's a "ragged" array? It can't use a 2D array under the hood because dataframes are heterogeneous while numpy arrays are homogeneous. – Pepijn Mar 05 '19 at 07:49
  • A ragged array usually refers to an array of arrays of different lengths, but in this case I just meant a bunch of different 1D arrays which are not stored as a single 2D array. Numpy can also store heterogeneous 2D arrays but they are still stored as a single 2D array. You can think of these like an array of structs in C. – user545424 Mar 05 '19 at 18:14
  • Can you pre-build the `DataFrame`, i.e. with an `index` and `columns` to meet your needs and a fill value (e.g. `NaN`), then iterate through the arrays and put them in the correct place? – blalterman May 04 '19 at 05:29
  • @blalterman Can I? I know the length and column types of the DataFrame, but there does not appear to be an API for pre-alocating DataFrames. – Pepijn May 05 '19 at 06:13
  • @Pepijn What about `df = pd.DataFrame(, index=, columns=)`, where you replaces `idx` and `cols` with the appropriate index and columns? I also suggest a `fill` values appropriate to your data. I've seen `-9999.0` used, `1e30`, etc. Just make sure the value is consistent so that you can check for it and make sure you visited all locations in the DataFrame. I tend to avoid using `NaN` in this situation b/c I can always `df.replace(, np.nan)` once I've loaded everything. – blalterman May 05 '19 at 21:55
  • @blalterman the problem is my data is heterogenous, so just using a numpy array for fill wont work. One thing I have not tried that might work: Maybe it can use a numy struct array without copying. – Pepijn May 06 '19 at 08:52
  • 1
    @Pepijn Is your data within each column a single `dtype`? If so, then just create each column as a `pd.Series` with its own fill value and combine them or cast the `pd.DataFrame` columns to the appropriate `dtypes` after creating it with `fill`? – blalterman May 07 '19 at 11:37
  • @blalterman sure, but the goal is to do so without copying. All of your solutions would copy the data into the dataframe. – Pepijn May 08 '19 at 09:25

3 Answers3

1
pandas.DataFrame ({"col": nparray, "col": nparray})

This works if you try list (nparray) instead. Here's a generic example:

import numpy as np
import pandas as pd

alpha = np.array ([1, 2, 3])
beta = np.array ([4, 5, 6])
gamma = np.array ([7, 8, 9])

dikt = {"Alpha" : list (alpha), "Beta" : list (beta), "Gamma":list (gamma)}

data_frame = pd.DataFrame (dikt)
print (data_frame)
Nepumuk
  • 243
  • 1
  • 3
  • 13
ats
  • 141
  • 6
1

I don't think this fully answers the question but it might help.

1-when you initialize your dataframe directly from 2D array, a copy is not made.

2-you don't have 2D arrays, you have 1D arrays, how do you get 2D arrays from 1D arrays without making copies, I don't know.

To illustrate the points, see below:

a = np.array([1,2,3])
b = np.array([4,5,6])
c = np.array((a,b))
df = pd.DataFrame(c)
a = np.array([1,2,3])
b = np.array([4,5,6])
c = np.array((a,b))
df = pd.DataFrame(c)

print(c)
[[1 2 3]
 [4 5 6]]

print(df)
   0  1  2
0  1  2  3
1  4  5  6

c[1,1]=10
print(df)
   0   1  2
0  1   2  3
1  4  10  6

So, changing c indeed changes df. However if you try changing a or b, that does not affect c (or df).

user2677285
  • 313
  • 1
  • 7
  • I've upvoted this because I think it's right (dataframes are backed by a 2D array so of course you can't build it from 1D arrays without copying) and because it's the only answer that attempts to address the question. – DavidW Oct 27 '19 at 09:01
0

May I suggest adding the columns one by one. It might help with efficiency. Like this for example,

import numpy as np
import pandas as pd

df = pd.DataFrame()

col1 = np.array([1, 2, 3])
col2 = np.array([4, 5, 6])

df['col1'] = col1
df['col2'] = col2
>>> df
   col1  col2
0     1     4
1     2     5
2     3     6
Arash
  • 133
  • 7