2

Background:

I am new to Python and programming in general. I want to know whether my methodology for implementation is correct, or whether it should be done a better way.

Data:

I have a csv, with around 400 types of stocks i.e.

Date, SH Ltd, Date, QS Ltd, Date, WX Ltd ...
26/02/18, 34, 16/06/13, 5634, 15/06/17, 100
27/02/18, 33, 17/06/13, 5763, 16/06/17, 100
28/02/18 ,35, 18/06/13, 6139, 17/06/17, 100
...

So every first column has some random start date but ALL dates will end as of yesterday. Now lets say I want to do 3 things,

1) Calculate Vol for last 252 days 2) Calculate the worst three 2 days gap of east stock.

My Approach

I am currently thinking I loop through each column of the CSV, create a time-series vector. Then I can run a method on it to calculate the 3 day gaps for start date + 2 till today. Then create a vector, sort this vector from big to small and spit out the 3 smallest. Then I take the last 252 days, work out stdev and multiply by the square root of 252.

After this I have 2 outputs

1) A vector of worst 2 day performances (5 worst) 2) A number for the last 252 days of vol.

At this point I put this data into a csv for 2D perf and a volatility csv. Then I just continue to loop through every other column until it has data and add to the csv files above.

My Question:

Is this a very inefficient method? And does this continual use of multiple vectors slow my program down heavily compared to using a single vector for just one calculation?

Amadan
  • 191,408
  • 23
  • 240
  • 301

1 Answers1

0

Let's start with some clean explanation of the question and short answer, then discuss on general optimization issues.

I have created a file with some more lines, and also removed spaces from the column names (to make things simpler).

Date,SH,Date,QS,Date,WX
26/02/18, 34, 16/06/13, 5634, 15/06/17, 100
27/02/18, 33, 17/06/13, 5763, 16/06/17, 100
1/03/18 ,35, 18/06/13, 6139, 17/06/17, 100
2/03/18 ,35, 18/06/13, 6139, 17/06/17, 100
3/03/18 ,35, 18/06/13, 6139, 17/06/17, 100
4/03/18 ,39, 18/06/13, 6139, 17/06/17, 100
5/03/18 ,35, 18/06/13, 6139, 17/06/17, 100
6/03/18 ,35, 18/06/13, 6139, 17/06/17, 100
7/03/18 ,35, 18/06/13, 6114, 17/06/17, 100
8/03/18 ,35, 18/06/13, 6139, 17/06/17, 100
9/03/18 ,35, 18/06/13, 6139, 17/06/17, 100
19/03/18 ,35, 18/06/13, 6139, 17/06/17, 100

import pandas as pd

fn = "testlogs" WINDOW = 3

df = pd.read_csv(fn)

print(df.SH.rolling(WINDOW).std(ddof=0))

0          NaN
1          NaN
2     0.816497
3     0.942809
4     0.000000
5     1.885618
6     1.885618
7     1.885618
8     0.000000
9     0.000000
10    0.000000
11    0.000000
Name: SH, dtype: float64

For more about VOL see here

Now, regarding effieciency, "premature optimization is the root of all evil" (Knuth). In you case , if your have only 400 stocks, and only one line per day, then this is less than 14MB of data for 100 years, and cheap smartphones typically have 16GB of RAM. So all your data will sit in ram, and even if you read each value many times the calculation of VOL should take fraction of a second with very low CPU usage.

More about my specific implementation - I use pandas, with the following advantages: (1) short and simple code (2) Efficiency due to vector calculations

Disadvantages: Need to learn this library which could be overwhelming for beginners.

Ofer Rahat
  • 790
  • 1
  • 9
  • 15