Python Pandas - from data frame create an array or matrix for multiplication

Question

I found this prior post and it gets me close. how-to-convert-a-pandas-dataframe-subset-of-columns-and-rows-into-a-numpy-array

But instead of making a single array (or matrix) of two columns based on the value in a third, I need to iterate through the data frame and create a 3x3 array (or matrix) from columns 'b' through 'j' for each correctly matching value in 'a'.

         dft = pd.DataFrame({'a' : ['NW'  ,'NW', 'SL', 'T'], 
'b' : [1,2,3,4], 
'c' : [5,6,7,8], 
'd' : [11,12,13,14], 
'e' : [9,10,11,12], 
'f' : [4,3,2,1], 
'g' : [15,14,13,12], 
'h' : [13,14,15,16], 
'i' : [5,4,3,2], 
'j' : [9,8,7,6]
})

    print(dft)
         a  b   c   d   e   f   g   h   i   j
     0  NW  1   5   11  9   4   15  13  5   9
     1  NW  2   6   12  10  3   14  14  4   8
     2  SL  3   7   13  11  2   13  15  3   7
     3  T   4   8   14  12  1   12  16  2   6

What I want is 2 separate arrays, 1 for each NW

     [[ 1  5 11]
      [ 9  4 15]
      [13  5  9]]

     [[ 2  6 12]
      [10  3 14]
      [14  4  8]]

I have tried the following and received a really ugly error. The code is an attempt based on the original post.

    dft.loc[dft['a'] == 'NW',['b', 'c', 'd'], ['e', 'f', 'g'], ['h', 'i', 'j']].values

Here is the error -

IndexingError Traceback (most recent call last) in () ----> 1 dft.loc[dft['a'] == 'NW',['b', 'c', 'd'], ['e', 'f', 'g'], ['h', 'i', 'j']].values

D:\Applications\Anaconda\lib\site-packages\pandas\core\indexing.py in getitem(self, key) 1323 except (KeyError, IndexError): 1324 pass -> 1325 return self._getitem_tuple(key) 1326 else: 1327 key = com._apply_if_callable(key, self.obj)

D:\Applications\Anaconda\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup) 839 840 # no multi-index, so validate all of the indexers --> 841 self._has_valid_tuple(tup) 842 843 # ugly hack for GH #836

D:\Applications\Anaconda\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key) 186 for i, k in enumerate(key): 187 if i >= self.obj.ndim: --> 188 raise IndexingError('Too many indexers') 189 if not self._has_valid_type(k, i): 190 raise ValueError("Location based indexing can only have [%s] "

IndexingError: Too many indexer

Thoughts? I am so close, yet tantalizing far.

And I have no clue how to format the error code- so any help on that to clear it up?

Vaishali · Accepted Answer · 2018-01-02T02:17:42.880

5

You can do this without loop

a = df.loc[df['a'] == 'NW', 'b':'j']
n = a.shape[0]
new_a = a.values.reshape(n,3,3)

You get

array([[[ 1,  5, 11],
        [ 9,  4, 15],
        [13,  5,  9]],

       [[ 2,  6, 12],
        [10,  3, 14],
        [14,  4,  8]]])

edited Jan 02 '18 at 02:17

answered Jan 02 '18 at 01:50

Vaishali

37,545
5
58
86

I like this solution, but it assumes that there are only 2 rows with 'NW' in the 'a' column. Do you know if there's a way to do this in one line without having to hard-code the first value (2) in `reshape()`? – Oriol Mirosa Jan 02 '18 at 01:58
@OriolMirosa, you are right. See the edit to my solution – Vaishali Jan 02 '18 at 02:01
Nice! I guess you could even delete the second line and have the third be simply `new_a = a.values.reshape(a.shape[0], 3, 3)`, right? – Oriol Mirosa Jan 02 '18 at 02:13
@OriolMirosa, yes. I just created a variable n for clarity – Vaishali Jan 02 '18 at 02:16
This needs to iterate over an Excel document of unknown length with unknown number of values to match against. I made a very simple sample of the data. – Jan 02 '18 at 02:22
The long of this is I need to take Excel row values, extract to an array / matrix, perform matrix multiplication on it. This is to convert electrical three phase per-phase values to sequence values to extract values for further calculations. – Jan 02 '18 at 02:24
You have tagged pandas so I am assuming that you will load the excel into pandas dataframe first. If so, this solution will work just fine – Vaishali Jan 02 '18 at 02:25
I tried the above solution -> works! Thanks all. Can you please explain it? I understand the `loc` call, but why two variables and how is `n` helping `a` to create the arrays? – Jan 02 '18 at 02:47
a is the slice of the dataframe with only the rows that have 'NW' in column a. Find the number of rows in a using a.shape[0] and save it in variable n. Now create an array from a using a.values and reshape it with (n - no of rows with NW, 3 - rows, 3 - columns) – Vaishali Jan 02 '18 at 03:00
1

And here I was thinking it would be hard ;) Thanks to this board! – Jan 02 '18 at 03:06

Oriol Mirosa · Answer 2 · 2018-01-02T01:53:53.030

0

I'm not 100% sure what you're after, but maybe this will help:

new_arrays = []

for index, row in dft.iterrows():
    if row['a'] == 'NW':
        new_arrays.append(row[1:].values.reshape(3, 3))

With itertuples(), as requested in the comments:

for index, row in enumerate(dft.itertuples(), 1):
    if row[1] == 'NW':
        new_arrays.append(np.array(row[2:]).reshape(3, 3))

Now you have each of the two arrays in new_arrays, and you can print them together or access individually:

new_arrays[0]

array([[1, 5, 11],
       [9, 4, 15],
       [13, 5, 9]], dtype=object)

new_arrays[1]

array([[2, 6, 12],
       [10, 3, 14],
       [14, 4, 8]], dtype=object)

edited Jan 02 '18 at 01:53

answered Jan 02 '18 at 01:21

Oriol Mirosa

2,756
1
13
15

That worked in my sample data perfectly. but in trying it with my real data, where I simply did `if row ['Code'] == 'Network'` it tells me `Code' is not defined`. What am I missing? The rest of the data is exactly the same, just more columns in the raw data. Also, I read in the Pandas documentation that `iterrows` is bad and to use `itertuples`. I tried that and it said `ValueError: too many values to unpack (expected 2)`. – Jan 02 '18 at 01:41
It looks like your data frame doesn't have a column called `Code`. Is that the case? As for `itertuples()`, it makes things a little more awkward, but I added the solution to the answer. – Oriol Mirosa Jan 02 '18 at 01:53
Doing `dft.columns` returns `Code\n`. Added the \n and it worked. Gues because the column is text????? – Jan 02 '18 at 02:43
The `\n` is the newline character, you might want to clean it in `dft.colums` in order to avoid problems in the future. I’m glad it works now. – Oriol Mirosa Jan 02 '18 at 02:54

Python Pandas - from data frame create an array or matrix for multiplication

2 Answers2

Linked