2

For each of the columns in my df, I want to subtract the current row from the previous row (row[n+1]-row[n]), but I am having difficulty.

My code is as follows:

#!/usr/bin/python3
from pandas_datareader import data
import pandas as pd
import fix_yahoo_finance as yf
yf.pdr_override()
import os


stock_list = ["BHP.AX", "CBA.AX", "RHC.AX", "TLS.AX", "WOW.AX", "^AORD"]

# Function to get the closing price of the individual stocks
# from the stock_list list
def get_closing_price(stock_name, specific_close):
    symbol = stock_name
    start_date = '2010-01-01'
    end_date = '2016-06-01'
    df = data.get_data_yahoo(symbol, start_date, end_date)
    sym = symbol + " "
    print(sym * 10)
    df = df.drop(['Open', 'High', 'Low', 'Adj Close', 'Volume'], axis=1)
    df = df.rename(columns={'Close': specific_close})
    # https://stackoverflow.com/questions/16729483/converting-strings-to-floats-in-a-dataframe
    # df[specific_close] = df[specific_close].astype('float64')
    print(type(df[specific_close]))
    return df

# Creates a big DataFrame with all the stock's Closing
# Price returns the DataFrame
def get_all_close_prices(directory):
    count = 0
    for stock_name in stock_list:
        specific_close = stock_name.replace(".AX", "") + "_Close"
        if not count:
            prev_df = get_closing_price(stock_name, specific_close)
        else:
            new_df = get_closing_price(stock_name, specific_close)
            # https://stackoverflow.com/questions/11637384/pandas-join-merge-concat-two-dataframes
            prev_df = prev_df.join(new_df)
        count += 1
    prev_df.to_csv(directory)
    return prev_df

# THIS IS THE FUNCTION I NEED HELP WITH
# AS DESCRIBED IN THE QUESTION
def calculate_return(df):
    count = 0
    # for index, row in df.iterrows():
    print(df.columns[0])
    for stock in stock_list:
        specific_close = stock.replace(".AX", "") + "_Close"
        print(specific_close)
        # https://stackoverflow.com/questions/15891038/change-data-type-of-columns-in-pandas
        pd.to_numeric(specific_close, errors='ignore')
        df.columns[count].diff()
        count += 1
     return df


def main():
    # FINDS THE CURRENT DIRECTORY AND CREATES THE CSV TO DUMP THE DF
    csv_in_current_directory = os.getcwd() + "/stk_output.csv"

    # FUNCTION THAT GETS ALL THE CLOSING PRICES OF THE STOCKS
    # AND RETURNS IT AS ONE COMPLETE DATAFRAME
    df = get_all_close_prices(csv_in_current_directory)

    # THIS PRINTS OUT WHAT IS IN "OUTPUT 1"
    print(df)

    # THIS FUNCTION IS WHERE I HAVE THE PROBLEM
    df = calculate_return(df)

    # THIS SHOULD PRINT OUT WHAT IS IN "EXPECTED OUTPUT"
    print(df)




# Main line of code
if __name__ == "__main__":
    main()

Question:

For each of the columns, I would like subtract current row from the previous row (row[n+1]-row[n]) and assign this value to a new column at the end of the dataframe as a new column as stock_name + "_Earning". My expected output (see: Expected Output) is that I still have the original df as seen in Output 1, but has 6 additional columns, with an empty first row, and the differences of the rows (row[n+1]-row[n]) therein in the respective column.

Problem Faced:

With the current code - I am getting the following error, which I have tried to get rid of

AttributeError: 'str' object has no attribute 'diff'

Things I Have Tried:

Some of the things I have tried:

Expected Output:

            BHP_Close  CBA_Close  RHC_Close  TLS_Close  WOW_Close        ^AORD  BHP_Earning  CBA_Earning  RHC_Earning  TLS_Earning  WOW_Earning  ^AORD_Earning
Date
2010-01-03  40.255699  54.574299  11.240000       3.45  27.847300  4889.799805
2010-01-04  40.442600  55.399799  11.030000       3.44  27.679100  4939.500000     0.186901       0.8255        -0.21        -0.01      -0.1682   49.70020000  

Output 1:

            BHP_Close  CBA_Close  RHC_Close  TLS_Close  WOW_Close  ^AORD_Close
Date
2010-01-03  40.255699  54.574299  11.240000       3.45  27.847300  4889.799805
2010-01-04  40.442600  55.399799  11.030000       3.44  27.679100  4939.500000
2010-01-05  40.947201  55.678299  11.180000       3.38  27.629601  4946.799805
...               ...        ...        ...        ...        ...          ...
2016-05-30  19.240000  78.180000  72.730003       5.67  22.389999  5473.600098
2016-05-31  19.080000  77.430000  72.750000       5.59  22.120001  5447.799805
2016-06-01  18.490000  76.500000  72.150002       5.52  21.799999  5395.200195
3kstc
  • 1,871
  • 3
  • 29
  • 53
  • Your question and code are not very clear. What is the output of `print(df)` and what is the output of `calculate_return(df);` – Tony Tannous Feb 27 '18 at 00:28
  • @TonyTannous _What is the output of print(df)_ It is under **Output 1**; _what is the output of calculate_return(df)_ This is where I am having difficulty, namely the `AttributeError: 'str' object has no attribute 'diff'` error which I couldn't resolve :(. – 3kstc Feb 27 '18 at 00:46
  • Your question is overly verbose. I would suggest the following steps. Iterate over the columns, access each column and apply `numpy.diff()` on that column. That column should be a numpy numeric array (not strings). It can be done as: `for col in df.columns: myDiff = np.diff(df[col].as_matrix())`. If your `df[col]` contains numbers formatted as strings, you should convert it to `float` or `int` first, whatever is applicable. – Autonomous Feb 27 '18 at 00:58

1 Answers1

1

Here is an easy and quick way to do what you want:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(25).reshape(5, 5), 
                  columns=['A', 'B', 'C', 'D', 'E'])
print(df)

result:

    A   B   C   D   E
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
4  20  21  22  23  24

We can use the shift member function to move the entire dataframe up (or down). Then we just have to subtract this from the original, and rename the columns.

df2 = df - df.shift(1, axis=0) 
df2.columns = [col + '_earning' for col in df2.columns]
print(df2)

result:

   A_earning  B_earning  C_earning  D_earning  E_earning
0        NaN        NaN        NaN        NaN        NaN
1        5.0        5.0        5.0        5.0        5.0
2        5.0        5.0        5.0        5.0        5.0
3        5.0        5.0        5.0        5.0        5.0
4        5.0        5.0        5.0        5.0        5.0

Then just join the result with the original.

result = pd.concat([df, df2], axis=1)
print(result)

result:

    A   B   C   D   E  A_earning  B_earning  C_earning  D_earning  E_earning
0   0   1   2   3   4        NaN        NaN        NaN        NaN        NaN
1   5   6   7   8   9        5.0        5.0        5.0        5.0        5.0
2  10  11  12  13  14        5.0        5.0        5.0        5.0        5.0
3  15  16  17  18  19        5.0        5.0        5.0        5.0        5.0
4  20  21  22  23  24        5.0        5.0        5.0        5.0        5.0

edit: after revisiting your post, it looks like your trying to do this operation on some columns containing strings? Either filter them out or convert to a datatype which supports the '-' operator.

evamicur
  • 403
  • 3
  • 8