0

As I've been working with a list of DataFrames for my analysis I would like to find a way to recreate this list faster. Also tips on good practices are welcome.

This code is simply taking too long when I use more stocks. I'd like to improve this part:

stocks_list_DataFrames = []
stocks_all_symbol_list = list(stocks_all_csv['Symbol'].unique())
for symbol in stocks_all_symbol_list:
    stock_data = stocks_all_csv[(stocks_all_csv['Symbol'] == symbol)] 
    stocks_list_DataFrames.append(stock_data)
    

And for reproducibility, copy the following:

import pandas as pd
from datetime import date
import yfinance as yf

stocks_all = []
start = date(2017, 10, 1)
end = date(2020, 6, 25)
list_symbols = ["CERS", "CERU", "CETV", "CEVA", "CFA", "CFBK", "CFFI", "CFFN",
               "CFGE", "CFNB", "CFNL", "CFO", "CFRX", "CFRXW", "CFRXZ", "CG", 
               "CGEN", "CGIX", "CGNX", "CGO", "CHCI", "CHCO", "CHDN", "CHEF",
               "CHEV"]

for symbol in list_symbols :
    print(symbol)
    stock_data = yf.download(symbol, start, end)
    stock_data.insert(0, 'Symbol', symbol) 
    stocks_all.append(stock_data)
    
# pd.concat(stocks_all).to_csv('stocks_all.csv')
# stocks_all_csv = pd.read_csv('.../stocks_all.csv')

stocks_all_csv = pd.concat(stocks_all)

Any help would be greatly appreciated.

Artur Dutra
  • 515
  • 3
  • 18

1 Answers1

1

You are filterig the whole df on every stock name. Try .groupby() instead. It indexes the df once based on the selected features and returns a groupby object with the list of unique features (or combination or features) and the indexes of matching rows.

Loop as following:

for symbol, stock_data in stocks_all_csv.groupby('Symbol'):

Now symbol is a string(as in the df) and stock_data is a filtered df, just as in your code.

The main argument in groupby can be a column, level, mapping, function, indexer, or a list containing any of afore mentioned types.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

RichieV
  • 5,103
  • 2
  • 11
  • 24
  • It seems a good alternative but still lacks the speed that I'm looking for...Thanks – Artur Dutra Jul 27 '20 at 22:57
  • Can you share the time comparison? – RichieV Jul 27 '20 at 23:06
  • I started running when you did the comment... it didn't stop yet. The last time took almost this time as well – Artur Dutra Jul 27 '20 at 23:11
  • Just took another look at your data...Why are you appending and then separating the data? Do you perform a joint analysis on all? Can you store separate dfs? – RichieV Jul 27 '20 at 23:14
  • I guess this would be another solution: use a loop for pd.write_csv() on each dataframe and then do the same to load them. I will try that – Artur Dutra Jul 27 '20 at 23:18
  • I was expecting a solution with iterrows() or lambda. Neither I could manage to understand how it works. But thanks anyway – Artur Dutra Jul 27 '20 at 23:20
  • 1
    Iteration is the least efficient of methods in pandas... see this question https://stackoverflow.com/q/16476924/6692898 and the warning on "getting started" section of the documentation https://pandas.pydata.org/docs/getting_started/basics.html?highlight=iteration#iteration – RichieV Jul 27 '20 at 23:26
  • This is also good reading https://stackoverflow.com/q/8097408/6692898 – RichieV Jul 27 '20 at 23:31