Pull Column from DataFrame and Calculate the Standard Deviation for Each Column in Each Cluster

Question

I realize how, confusing the title sounds so let me explain my issue. I have a DataFrame separated by the Column ID. The Column ID represents the cluster. Each Cluster DataFrame has the same Column labels.

I'm trying to create a function, that allows me to send in each cluster DataFrame, into the function, and then returns calculated the standard deviation of that column (for 14 columns). Image of cluster_0 below:

Cluster 0 DataFrame

Of course, I could go thorough and list out each column for each cluster, but that's time consuming and not very efficient. If you could check my code and let me know where and how it went wrong I'd greatly appreciate it.

What I'm trying to achieve: the standard deviation of each (A -> N) column for each DataFrame(cluster)

My Code:


cluster_joint_col_name = list(["X", "Y", "Cluster ID", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N"])

joint_table_df.columns = cluster_joint_col_name
cluster_0                   = joint_table_df[joint_table_df['Cluster ID'] == 0]
cluster_1                   = joint_table_df[joint_table_df['Cluster ID'] == 1]
cluster_2                   = joint_table_df[joint_table_df['Cluster ID'] == 2]


def standardDeviation(self):
    self = self[['A']].stack().std()
    self = self[['B']].stack().std()
    self = self[['C']].stack().std()
    self = self[['D']].stack().std()
    self = self[['E']].stack().std()
    self = self[['F']].stack().std()
    self = self[['G']].stack().std()
    self = self[['H']].stack().std()
    self = self[['I']].stack().std()
    self = self[['J']].stack().std()
    self = self[['K']].stack().std()
    self = self[['L']].stack().std()
    self = self[['M']].stack().std()
    self = self[['N']].stack().std()

    return self


cluster_j_0 = pd.DataFrame(standardDeviation(cluster_0))

My error:

Traceback (most recent call last):

line 345, in

cluster_j_0 = pd.DataFrame(standardDeviation(cluster_0))

line 328, in standardDeviation

self = self[['B']].stack().std()

IndexError: invalid index to scalar variable.

IIUC, maybe try `df.loc[:, 'A':].groupby(df['Cluster ID']).std()` ..? — Chris Adams, Feb 28 '20 at 15:45
You should wonder what others will need to answer your question. And you should realize that we often need to reproduce (at least partially) what you have done. That is the reason why SO rules require that you show you code. When it comes to pandas a **copyable** data sample in required too. You should read [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). Long story short avoid if possible linked of pasted *images* — Serge Ballesta, Feb 28 '20 at 15:46
@SergeBallesta the code itself is over 300 lines to get to this point, so I figured an image was the best way to show how it looks. Also the only way to show the image was to link it as my reputation isn't high enough to do anything else — acm151130, Feb 28 '20 at 15:54
Instead of copying the image, you could have copy the same data as text to paste it in the question. — Serge Ballesta, Feb 28 '20 at 16:03
So your next question will be nicer... Anyway Chris could help you here :-) — Serge Ballesta, Feb 28 '20 at 16:14

Chris Adams · Accepted Answer · 2020-02-28T15:57:12.023

IIUC, use loc to filter, then use groupby.std:

Example

# Setup - Create toy data
np.random.seed(0)
df = pd.DataFrame({'X':np.random.randn(100), 'Y':np.random.randn(100), 'Cluster ID': np.random.choice([1, 2, 3, 4, 5], size=100)})
df = df.join(pd.DataFrame(np.random.randn(100, 13), columns=list('ABCDEFGHIJLMN'))).sort_values('Cluster ID')
print(df.head())

#            X         Y  Cluster ID         A         B         C         D  \
# 78 -0.311553 -0.455533           1  0.505387  0.359249 -1.582494  2.243602   
# 22  0.864436  0.298238           1 -0.261645 -0.182245 -0.202897 -0.109883   
# 21  0.653619 -1.099401           1 -0.888971  0.242118 -0.888720  0.936742   
# 93  0.976639 -1.168093           1 -1.067742  1.761266  0.754096 -0.625027   
# 42 -1.706270  0.166673           1  0.074586 -1.077099 -0.424663 -0.829965   

#            E         F         G         H         I         J         L  \
# 78 -1.422795  1.922325 -2.115056  1.405365  1.618054 -0.824409  0.422580   
# 22  0.213480 -1.208574 -0.242020  1.518261 -0.384645 -0.443836  1.078197   
# 21  1.412328 -2.369587  0.864052 -2.239604  0.401499  1.224871  0.064856   
# 93 -0.390393  0.112558 -0.655545  0.067517  0.777604 -0.035743  0.336016   
# 42  1.411172  0.785804 -0.057470 -0.391217  0.940918  0.405204  0.498052   

#            M         N  
# 78  0.547481 -0.813794  
# 22 -2.559185  1.181379  
# 21 -1.279689 -0.585431  
# 93  0.886492 -0.272132  
# 42 -0.026192 -1.688230  

df.loc[:, 'A':'N'].groupby(df['Cluster ID']).std()

[out]

                   A         B         C         D         E         F         G         H         I         J         L         M         N
Cluster ID                                                                                                                                  
1           1.075660  0.840725  1.159909  0.784231  1.008940  1.202760  0.917767  0.892579  1.122632  0.808477  0.733555  0.873824  0.966545
2           0.847821  0.962078  0.949521  1.237155  0.862592  1.033416  0.953247  0.901012  1.081884  1.037335  1.081790  1.094946  1.056148
3           0.557165  0.665750  0.829684  0.910348  0.983214  0.999066  1.006437  1.108337  0.930933  1.452574  0.993500  1.164469  0.875057
4           0.806774  1.035797  0.621237  0.802998  0.580498  0.670066  1.061085  1.067594  1.072780  0.914693  1.083995  0.573358  1.098594
5           0.811129  1.299342  0.872402  0.964955  0.911904  0.862126  0.949792  0.996712  1.099537  0.973381  0.833759  1.401223  1.034191

Pull Column from DataFrame and Calculate the Standard Deviation for Each Column in Each Cluster

1 Answers1

Example