-1

i have a large dataset and i am computing daily standartd deviation of residual for each ID, the code is correct , however when I compile the code, it just keeps on running and running.

This is my data

enter image description here

this is my code:

the first two lines creates a repetitif output for each ID, that will be displayed in my dataframe in order to compute easily the variance and std by the last 3 codes.

C['mean'] = C.apply(lambda x: C[(C.ID == x.ID)].residual.mean(), axis=1)
C['size']=C.apply(lambda x: C[(C.ID == x.ID)].residual.count(), axis=1)



C['diff2']=(C['residual']-C['Mean'])**2
C['var']=C['diff2']/(B['size']-1)
C['stddev'] = C['var']** 0.5

My question is how to increase the efficiency of this code?

khalilnait
  • 15
  • 3
  • 1
    You've provided 5 lines of code entirely without context. How are we supposed to explain why it's taking too long to run? Please see [mcve] and [ask]. An image of your data and a few lines of code with undeclared variable types and no information about where and how those lines are being executed isn't very useful. You're going to need to [edit] your post to provide more information. – Ken White May 05 '20 at 00:49
  • Learn about [python step by step debugging](https://stackoverflow.com/questions/4929251/how-to-step-through-python-code-to-help-debug-issues) so you'll know what's your code is actually doing – Martheen May 05 '20 at 00:51
  • @KenWhite when i apply this code for small data , it works , however with large data , it keeps running without any results. – khalilnait May 05 '20 at 01:34
  • The [site guidelines](http://stackoverflow.com/help/on-topic) require that you provide a [mre] that demonstrates the issue. As I previously said, 5 lines of out-of-context code do not satisy that requirement. – Ken White May 05 '20 at 01:36

1 Answers1

0

The problem is that you're repeatedly filtering the DataFrame searching for all records where the IDs match the current row. Furthermore, you're doing this twice: once for mean and once for size.

This is situation where you should be using a groupby() on the ID, and aggregating the residual.

If I understand your end goal to be computing the standard deviation for each ID, then try something like this:

import numpy as np
D = C.groupby("ID")["residual"].agg([np.mean, np.size, np.var, np.std]).reset_index()

D should be a DataFrame with the computed statistics (may need to rename columns).

putnampp
  • 341
  • 2
  • 8
  • thank you for the feedback, your codes compute the mean by each group. my end goal is to compute **daily** variance and std by each ID . The only problem i am facing is to calculate the mean and size by each group which is correct in your code. i just want to display the mean and size results into my dataframe even it will be **repetitif** this will facilitate me to compute the last 3 lines of my code. thank you so much – khalilnait May 05 '20 at 01:14
  • actually when i am running this with small-sized dataframe it works. with large dateframe, it keeps running and running – khalilnait May 05 '20 at 01:19
  • So, join/merge D and C together by ID. D computes the global statistics for each ID once; the merge adds the global statistic columns you're interested in. That said, I'm not sure what you mean by daily variance. If there is a date column in your data, just add it to the groupby(). Please expand your description to more accurately represent what it is you are attempting to accomplish. The daily aspect of your ask is completely ambiguous. – putnampp May 05 '20 at 01:27
  • Thank you so much sir, you saved me. The daily aspect is to compute idiosyncratic volatility for each stock at time t. now that i have the mean and size i can easily do this job. thank you so much. appreciate your help. – khalilnait May 05 '20 at 01:41