-1

I am having trouble reformatting a dataframe.

My input is a day value rows by symbols columns (each symbol has different dates with it's values):

Input

code to generate input

data = [("01-01-2010", 15, 10), ("02-01-2010", 16, 11), ("03-01-2010", 16.5, 10.5)]
labels = ["date", "AAPL", "AMZN"]
df_input = pd.DataFrame.from_records(data, columns=labels)

The needed output is (month row with new row for each month):

Needed output

code to generate output

data = [("01-01-2010","29-01-2010", "AAPL", 15, 20), ("01-01-2010","29-01-2010", "AMZN", 10, 15),("02-02-2010","30-02-2010", "AAPL", 20, 32)]
labels = ['bd start month', 'bd end month','stock', 'start_month_value', "end_month_value"]
df = pd.DataFrame.from_records(data, columns=labels)

Meaning (Pseudo code) 1. for each row take only non nan values to create a new "row" (maybe dictionary with the date as the index and the [stock, value] as the value. 2. take only rows that are business start of month or business end of month. 3. write those rows to a new datatframe.

I have read several posts like this and this and several more. All treat with dataframe of the same "type" and just resampling while I need to change to structure...

My code so far

# creating the new index with business days
df1 =pd.DataFrame(range(10000), index = pd.date_range(df.iloc[0].name, periods=10000, freq='D'))
from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
df2 = df1.resample(bmth_us).mean()

# creating the new index interseting my old one (daily) with the monthly index
new_index = df.index.intersection(df2.index)

# selecting only the rows I want
df = df.loc[new_index]

# creating a dict that will be my new dataset
new_dict = collections.OrderedDict()
# iterating over the rows and adding to dictionary
for index, row in df.iterrows():
#     print index
    date = df.loc[index].name
    # values are the not none values
    values = df.loc[index][~df.loc[index].isnull().values]

    new_dict[date]=values


# from dict to list
data=[]
for key, values in new_dict.iteritems():
    for i in range(0, len(values)):
        date = key
        stock_name = str(values.index[i])
        stock_value = values.iloc[i]
        row = (key, stock_name, stock_value)
        data.append(row)

# from the list to df
labels = ['date','stock', 'value']
df = pd.DataFrame.from_records(data, columns=labels)
df.to_excel("migdal_format.xls")

Current output I get

One big problem:

  1. I only get value of the stock on the start of month day.. I need start and end so I can calculate the stock gain on this month..

One smaller problem:

  1. I am sure this is not the cleanest and fastest code :)

Thanks a lot!

  • 1
    Can you add [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) ? – jezrael Jul 13 '18 at 07:39
  • my pictures are not good? – koren maliniak Jul 13 '18 at 07:57
  • No, because is not possible copy data, please check [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – jezrael Jul 13 '18 at 07:58
  • I have managed to get both start and end of month dates (I think) with this code.. but still l dont know how to take those two values and put in the same column – koren maliniak Jul 13 '18 at 07:59
  • Pictures as code or data example are **never** good. How do you expect that someone can reproduce your code from a picture? Can *you* run your pictures in a console to get the code/data? ;) Btw.: I'm interested in getting stock-data. Where did you get that data from? – JE_Muc Jul 13 '18 at 07:59
  • Intrinio is my data source – koren maliniak Jul 13 '18 at 08:00
  • added code to generate output. working on input. sucks that people down vote so quickly. – koren maliniak Jul 13 '18 at 08:09
  • added code to generate input – koren maliniak Jul 13 '18 at 08:14
  • Thanks! Is there any advantage of Intrinio over AlphaVantage? Your input code does not span a whole month, so it is impossible to generate start and end values of a month. Besides that it is a `RangeIndex`. You should be working with a `DatetimeIndex` to get the full pandas-functionality. – JE_Muc Jul 13 '18 at 08:26
  • @korenmaliniak - Not sure how you get output `df` from `df_input`, I guess data are different. – jezrael Jul 13 '18 at 08:41
  • It is not the same data.. you want me to make 35 rows of data for the daily input? it's not enough what I did to understand the framework? I can attach my real input data as a file.. but I dont have the output data.. if I had - this would not be a question – koren maliniak Jul 13 '18 at 10:08

1 Answers1

0

So I have found a way.

  1. looping through each column
  2. groupby month
  3. taking the first and last value I have in that month
  4. calculate return

    df_migdal = pd.DataFrame() for col in df_input.columns[0:]: stock_position = df_input.loc[:,col]

    name = stock_position.name
    name = re.sub('[^a-zA-Z]+', '', name)
    name = name[0:-4]
    
    
    stock_position=stock_position.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])
    
    stock_position["name"] = name
    stock_position["return"] = ((stock_position["last"] / stock_position["first"]) - 1) * 100
    stock_position.dropna(inplace=True)
    df_migdal=df_migdal.append(stock_position)
    df_migdal=df_migdal.round(decimals=2)
    

I tried I way cooler way, but did not know how to handle the ,multi index I got... I needed that for each column, to take the two sub columns and create a third one from some lambda function.

df_input.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])