Averages of Subsets of Python Dataframe

Question

I am working with the sklearn digits dataset.

Each datapoint is a 8x8 image of a digit.

[[0,1,2,3, .... 62,63], # This row is one image
 [0,1,2,3, .... 62,63], # 0-8 make up the first row of the image
 ... 1794 more times
[0,1,2,3, .... 62,63]]

I set up my dataframe as follows:

from sklearn import datasets
digits = datasets.load_digits()
df = pd.DataFrame(data = digits.data)
df['target'] = digits.target

I am trying to iterate over each image and calculate averages over subsets of rows and columns.

To iterate over each image I just do the following: df[[i for i in range(64)]]

Or if I want a random subset of 8 pixels I do the following df[[random.sample(range(0, 64), 8)]]

Those I can wrap my head around. I am struggling with trying to iterate over subsets of each image. How would I iterate over every row of each image individually?

I can select the first row of the first image like this: df.iloc[:1,0:8]

While this will select the first column of the first image: df.iloc[:8,:1]

Ideally, I would like to output this structure:

[[image_1_col_1_avg..... col8_avg, row1_avg ..... row8_avg],
 [image_2_col_1_avg..... col8_avg, row1_avg ..... row8_avg],
   ....
 [image_1797_col_1_avg..... col8_avg, row1_avg ..... row8_avg]]

Where I shrink the 8*8 grid from 0-63 into the averages for each row and column. So instead of having 64 data points for each image, I would only have 16.

I have searched for a while but I can't find much documentation or guide on how to iterate through subsets of a dataframe. Of what I have found I can't really understand it. Any insight, guidance, or explanation of how to iterate over subsets of a dataframe will be much appreciated.

score 2 · Answer 1 · answered Feb 11 '18 at 10:48

You can use numpy - reshape to 3d array and get means per axis 1 and 2, last join both arrays together by numpy.hstack and call DataFrame constructor:

from sklearn import datasets
digits = datasets.load_digits()
df = pd.DataFrame(data = digits.data)

col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]

a = df.values
b = a.reshape((a.shape[0], -1, 8))
c = np.hstack((b.mean(axis=1), b.mean(axis=2)))

df = pd.DataFrame(c, columns = col_ind + row_ind)
print (df.head())
   col_av_1  col_av_2  col_av_3  col_av_4  col_av_5  col_av_6  col_av_7  \
0       0.0     2.250    10.500     6.000     5.000     8.500     4.500   
1       0.0     0.875     2.625    14.125    15.625     5.875     0.000   
2       0.0     1.625     6.125    10.875    12.500    10.125     1.750   
3       0.0     1.250     4.750     8.375    10.375     6.375     2.250   
4       0.0     1.125     4.875     8.375     8.625     7.125     2.125   

   col_av_8  row_av_1  row_av_2  row_av_3  row_av_4  row_av_5  row_av_6  \
0       0.0     3.500     7.250     4.875     4.000     3.750     4.375   
1       0.0     3.750     4.500     5.000     7.000     4.500     4.875   
2       0.0     3.875     6.000     5.625     4.125     4.750     5.750   
3       0.0     4.500     5.750     3.625     3.625     3.250     2.375   
4       0.0     1.500     1.875     3.000     4.875     6.625     8.125   

   row_av_7  row_av_8  
0     5.375     3.625  
1     4.875     4.625  
2     8.000     4.875  
3     5.000     5.250  
4     3.500     2.750

good approach but using a dataframe is not necessary for this problem , he needs only numpy arrays — Espoir Murhabazi, Feb 11 '18 at 18:05

sgDysregulation · Answer 2 · 2018-02-11T09:27:56.440

In pandas you very rarely need to use loops. you can always simplify the problem to a function getting applied to all the rows, i.e. each image, the following line does just that, iterate through the rows of data-frame df and applies the function func to the reshaped image

#select the image part of df and apply function    
df_res = df[range(64)].apply(func,axis=1)

now the problem becomes smaller, given a 1D image return the required averages

def func(img):
    # the input img is a series with length 64
    # convert to numpy array and reshape the image
    img = img.values.reshape(8, 8)
    # create the list of col_avg, row_avg to use in the result
    col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
    row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]

    res = pd.Series(index=col_ind + row_ind)
    # calculate the col average and assign it to the col_index in res
    res[col_ind] = img.mean(axis=0)
    # calculate the row average and assign it to the row_index in res
    res[row_ind] = img.mean(axis=1)
    return res

Running the line above after defining function produce the desired result. a sample of the output is shown below

In [44]: df_r = df[range(64)].apply(func,axis=1)

In [45]: df_r.head()
Out[45]: 
   col_av_1  col_av_2  col_av_3  col_av_4  col_av_5  col_av_6  col_av_7  \
0       0.0     2.250    10.500     6.000     5.000     8.500     4.500   
1       0.0     0.875     2.625    14.125    15.625     5.875     0.000   
2       0.0     1.625     6.125    10.875    12.500    10.125     1.750   
3       0.0     1.250     4.750     8.375    10.375     6.375     2.250   
4       0.0     1.125     4.875     8.375     8.625     7.125     2.125   

   col_av_8  row_av_1  row_av_2  row_av_3  row_av_4  row_av_5  row_av_6  \
0       0.0     3.500     7.250     4.875     4.000     3.750     4.375   
1       0.0     3.750     4.500     5.000     7.000     4.500     4.875   
2       0.0     3.875     6.000     5.625     4.125     4.750     5.750   
3       0.0     4.500     5.750     3.625     3.625     3.250     2.375   
4       0.0     1.500     1.875     3.000     4.875     6.625     8.125   

   row_av_7  row_av_8  
0     5.375     3.625  
1     4.875     4.625  
2     8.000     4.875  
3     5.000     5.250  
4     3.500     2.750

Edit: Alternatively use pandas groupby with modulus 8 to group the columns of the image and integer division by 8 to group the rows

# create an emply dataframe
df_re = pd.DataFrame()
# create col and row index names
col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]
df_re[col_ind] = df[range(64)].groupby(lambda x: x % 8, axis=1).mean()
df_re[row_ind] = df[range(64)].groupby(lambda x: x // 8, axis=1).mean()

score 1 · Accepted Answer · edited Feb 12 '18 at 06:38

1st APPROACH

My approach use numpy array and functions :

reshaping the data to a 3D array

data = digits.data.reshape(1797, 8, 8)

applying this function to each matrix in the 3D array and return the column average and row average

def a_function(x):
    row_average = np.apply_along_axis(np.average, 1, x)
    columns_average = np.apply_along_axis(np.average, 0, x)
    return np.append(columns_average, row_average)

Using that function to the array 3D array (There can be a fatest way to do it using only numpy )

maped = map(a_function, [data[i] for i in range(np.shape(data)[0])])

and create the final dataframe :

pd.DataFrame(maped)

2nd APPROACH

This is better than the first you need only numpy and apply_along axis function your data :

from sklearn import datasets
digits = datasets.load_digits()
data = digits.data
def a_function(x):
    x = x.reshape(8, 8)
    row_average = np.apply_along_axis(np.average, 1, x)
    columns_average = np.apply_along_axis(np.average, 0, x)
    return np.append(columns_average, row_average)

the above function will be applied to each row of your dataset like this :

final_data = np.apply_along_axis(a_function, 1, data)

final_data is a 1797 X 16 array you can use it in any classifier : this is what you need, it's not necessary to use a dataframe . The array looks like this :

array([[  0.   ,   2.25 ,  10.5  , ...,   4.375,   5.375,   3.625],
       [  0.   ,   0.875,   2.625, ...,   4.875,   4.875,   4.625],
       [  0.   ,   1.625,   6.125, ...,   5.75 ,   8.   ,   4.875],
       ..., 
       [  0.   ,   0.   ,  10.   , ...,   7.625,   7.625,   3.75 ],
       [  0.   ,   1.125,   7.75 , ...,   2.25 ,   4.5  ,   5.625],
       [  0.   ,   1.875,  12.25 , ...,   6.5  ,   8.25 ,   6.   ]])

PS : Using numpy functions for average is better than build-in python function because numpy used C for optimizations and you can go faster when you use numpy functions with numpy array instead of mixing python build-in functions with numpy array. For more check this

Thank you for the detailed explanation! – danielsmith1789 Feb 15 '18 at 07:11 — danielsmith1789, Feb 15 '18 at 07:11

Averages of Subsets of Python Dataframe

3 Answers3