0

I have a dataset which is a numpy array with shape (1536 x 16 x 48). A quick explanation of these dimensions that might be helpful:

  • The dataset consists of data collected by EEG sensors at 256Hz rate (1 second = 256 measures/values);
  • 1536 values represent 6 seconds of EEG data (256 * 6 = 1536);
  • 16 is the number of electrodes used to collect data;
  • 48 is the number of samples.

In summary: i have 48 samples of 6 seconds (1536 values) of EEG data, collected by 16 electrodes.

I need to create a pandas dataframe with all this data, and therefore turn this 3D array into 2D. The depth dimension (48) can be removed if i stack all samples one above another. So the new dataset will be shaped (1536 * 48) x 16.

In addition to that, since this is a classification problem, i have a vector with 48 values that represents the class of each EEG sample. The new dataset should also has this as a "class" column, and then the real shape would be: (1536 * 48) x 16 + 1 (class).

I could easily do that looping through the depth dimension of the 3D array and concatenate everything into a 2D new one. But this looks bad since i will be dealing with many datasets like this one. Performance is an issue. I would like to know if there's any more clever way of doing it.

I tried to provide the maximum of information i could for this question, but since it is not a trivial task feel free to ask further details if needed.

Thanks in advance.

heresthebuzz
  • 678
  • 7
  • 21
  • Does [Efficiently Creating A Pandas DataFrame From A Numpy 3d array](https://stackoverflow.com/questions/36235180/efficiently-creating-a-pandas-dataframe-from-a-numpy-3d-array) answer your question? – wwii Jan 19 '21 at 19:54

2 Answers2

0

For the numpy part

x = np.random.random((1536, 16, 48)) # ndarray with simillar shape
x = x.swapaxes(1,2) # swap axes 1 and 2 i.e 16 and 48
x = x.reshape((-1, 16), order='C') # order is important, you may want to check the docs
c = np.zeros((x.shape[0], 1)) # class column, shape=(73728, 1)
x = np.hstack((x, c)) # final dataset
x.shape

Output

(73728, 17)

or in one line

x = np.hstack((x.swapaxes(1,2).reshape((-1, 16), order='C'), c))

Finally,

x = pd.DataFrame(x)
paul-shuvo
  • 1,874
  • 4
  • 33
  • 37
  • I'm trying to reproduce your code, but i get the following error: `TypeError: 'tuple' object is not callable`. Do you have any clue what it is? – heresthebuzz Jan 19 '21 at 20:35
  • Made a small typo, `c = np.zeros((x.shape(0), 1))` should've been `c = np.zeros((x.shape[0], 1))`. FIxed now. – paul-shuvo Jan 19 '21 at 21:51
  • That is a really good way of transforming the 3D array into 2D, but what about the part of concatenating the 48-length vector into the new array? In your example, you concatenated the `c` vector as a vector with `73728` values instead of `48` – heresthebuzz Jan 19 '21 at 21:57
  • According to your post, the array shape should be (1536 * 48) x 16 + 1 = 73728x17. So, 73728 samples, 16 feature columns, and a classification column. When you say concatenating the 48-length vector, which dimension are you referring to? – paul-shuvo Jan 19 '21 at 22:04
0

Setup

>>> import numpy as np
>>> import pandas as pd
>>> a = np.zeros((4,3,3),dtype=int) + [0,1,2]
>>> a *= 10
>>> a += np.array([1,2,3,4])[:,None,None]
>>> a
array([[[ 1, 11, 21],
        [ 1, 11, 21],
        [ 1, 11, 21]],

       [[ 2, 12, 22],
        [ 2, 12, 22],
        [ 2, 12, 22]],

       [[ 3, 13, 23],
        [ 3, 13, 23],
        [ 3, 13, 23]],

       [[ 4, 14, 24],
        [ 4, 14, 24],
        [ 4, 14, 24]]])

Split evenly along the last dimension; stack those elements, reshape, feed to DataFrame. Using the lengths of the array's dimensions simplifies the process.

>>> d0,d1,d2 = a.shape
>>> pd.DataFrame(np.stack(np.dsplit(a,d2)).reshape(d0*d2,d1))
     0   1   2
0    1   1   1
1    2   2   2
2    3   3   3
3    4   4   4
4   11  11  11
5   12  12  12
6   13  13  13
7   14  14  14
8   21  21  21
9   22  22  22
10  23  23  23
11  24  24  24
>>>

Using your shape.

>>> b = np.random.random((1536, 16, 48))
>>> d0,d1,d2 = b.shape
>>> df = pd.DataFrame(np.stack(np.dsplit(b,d2)).reshape(d0*d2,d1))
>>> df.shape
(73728, 16)
>>>

After making the DataFrame from the 3d array, add the classification column to it, df['class'] = data. - Column selection, addition, deletion

wwii
  • 23,232
  • 7
  • 37
  • 77