I have a dataframe, df
, which contains a column called 'event' wherein there is a 24x24x40 numpy array. I want to:
- extract this numpy array;
- flatten it into a 1x23040 vector;
- add this entry as a column in a new numpy array or dataframe;
- perform PCA on the resulting matrix.
However, the PCA produces eigenvectors with the dimensions of 'the number of entries', not the 'number of dimensions in the data'.
To illustrate my problem, I demonstrate a minimal example that works perfectly well:
EXAMPLE 1
from sklearn import datasets, decomposition
digits = datasets.load_digits()
X = digits.data
pca = decomposition.PCA()
X_pca = pca.fit_transform(X)
print (X.shape)
Result: (1797, 64)
print (X_pca.shape)
Result: (1797, 64)
There are 1797 entries in each case, with eigenvectors of dimension 64.
Now onto my example:
EXAMPLE 2
from sklearn import datasets, decomposition
import pandas as pd
hdf=pd.HDFStore('./afile.h5')
df=hdf.select('batch0')
print(df['event'][0].shape)
Result: (1, 24, 24, 40)
print(df['event'][0].shape.flatten())
Result: (23040,)
for index, row in df.iterrows():
entry = df['event'][index].flatten()
_list.append(entry)
X = np.asarray(_list)
pca = decomposition.PCA()
X_pca=pca.fit_transform(X)
print (X.shape)
Result: (201, 23040)
print (X_pca.shape)
Result:(201, 201)
This has dimensions of the number of data, 201 entries!
I am unfamiliar with dataframes, so it could be that I am iterating through the dataframe incorrectly. However, I have checked that the rows of the resultant numpy array in X
in Example 2 can be reshaped and plotted as expected.
Any thoughts would be appreciated!
Kind regards!