5

Numpy:

import numpy as np
nparr = np.array([[1, 5],[2,6], [3, 7]])
print(nparr)
print(nparr[0])    #first choose the row 
print(nparr[0][1]) #second choose the column

gives the output as expected:

[[1 5]
 [2 6]
 [3 7]]

[1 5]

5

Pandas:

df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [5, 6, 7]
})
print(df)
print(df['a'])  #first choose the column !!!
print(df['a'][1])  #second choose the row !!!

gives the following output:

   a  b
0  1  5
1  2  6
2  3  7

0    1
1    2
2    3
Name: a, dtype: int64

2

What is the fundamental reason for changing the default ordering of "indexes" in Pandas dataframe to be column first? What is the benefit we get for this loss of consistency/intuitiveness?

Of course, if I use the iloc function we can code it similar to Numpy array indexing:

print(df)
print(df.iloc[0])     # first choose the row
print(df.iloc[0][1])  # second choose the column
   a  b
0  1  5
1  2  6
2  3  7

a    1
b    5
Name: 0, dtype: int64

5
FatihAkici
  • 4,679
  • 2
  • 31
  • 48
2020
  • 2,821
  • 2
  • 23
  • 40
  • I think of a DataFrame as composed of Series. The series/columns can differ in `dtype`. `numpy` has structured arrays, with fields that have their own dtypes. – hpaulj Dec 30 '19 at 04:04

2 Answers2

5

Because Numpy's intuition is mathematics (more specifically matrices, akin to MATLAB), while Pandas's is databases (akin to SQL). Numpy goes by rows and columns (rows first, because an element (i, j) of a matrix denotes the ith row and jth column), while Pandas works based on the columns of a database, inside which you choose elements, i.e. rows. Of course you can work directly on indices by using iloc, as you mentioned.

Hope the difference in paradigms/philosophies of the two makes sense.

FatihAkici
  • 4,679
  • 2
  • 31
  • 48
2

numpy indexing is multidimensional. pandas is table oriented, just 2d (except for a multi-index variation).

In [42]: nparr = np.array([[1, 5],[2,6], [3, 7]])                               
In [43]: nparr                                                                  
Out[43]: 
array([[1, 5],
       [2, 6],
       [3, 7]])
In [44]: nparr[0]             # select a row                                                               
Out[44]: array([1, 5])
In [45]: nparr[:,0]           # select a column                                    
Out[45]: array([1, 2, 3])
In [46]: nparr[:,[0]]         # also a column, but keep 2d                                                  
Out[46]: 
array([[1],
       [2],
       [3]])
In [47]: nparr[:2,[1,0]]      # more general - 2 rows, 2 columns (reordered)                                                  
Out[47]: 
array([[5, 1],
       [6, 2]])

Your nparr[0][1] is more idiomatically written as nparr[0,1].

This indexing generalizes to 3d (and higher):

In [48]: arr = np.arange(24).reshape(2,3,4)                                     
In [49]: arr                                                                    
Out[49]: 
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
In [50]: arr[1,1,:]                                                             
Out[50]: array([16, 17, 18, 19])

It also generalizes to 1d (which will be like indexing a list), and even 0d.

If I make a dataframe from this array, the data or values of the frame are the array itself:

In [52]: df = pd.DataFrame(nparr)                                               
In [53]: df                                                                     
Out[53]: 
   0  1
0  1  5
1  2  6
2  3  7
In [54]: df._values                                                             
Out[54]: 
array([[1, 5],
       [2, 6],
       [3, 7]])

If I modify an element of the array, we see the change in frame as well:

In [56]: nparr[0,1] *=100                                                       
In [57]: nparr                                                                  
Out[57]: 
array([[  1, 500],
       [  2,   6],
       [  3,   7]])
In [58]: df                                                                     
Out[58]: 
   0    1
0  1  500
1  2    6
2  3    7

In [61]: df[1]          # a Series                                                        
Out[61]: 
0    500
1      6
2      7
Name: 1, dtype: int64

pandas has added its own layer of indexing (including column and row labels) to the underlying array. It will, in one way or other, maps its indexing inputs onto the array's.

Since there are other ways of constructing a dataframe, there isn't always one to one match between a frame and an array.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • While i very much appreciate much of the information in your answer, I am unable to understand how this addresses my question. May be I am missing to comprehend properly the line where the answer is. Is the answer in this statement : _Since there are other ways of constructing a dataframe, there isn't always one to one match between a frame and an array._ ? – 2020 Dec 30 '19 at 13:10
  • 1
    My main intent was to show that the `numpy` row/column order is just an instance of its general multidimensional indexing. `pandas` designers, for what ever reason, chose a Series/column orientation, which they use even when the underlying data structure is a numpy array. Most of us can only answer **why** questions by discerning patterns, not by reading the intent of the designers. – hpaulj Dec 30 '19 at 17:39
  • Thanks. Your answer [here](https://stackoverflow.com/a/38595905/1733060) also helped me to get a better understanding, especially the part where you show the numpy array indexed using the field name first, similar to pandas, giving a columnar output similar to pandas series. – 2020 Dec 31 '19 at 14:28