0

I am trying to leave only rows in dataframe which are the latest from each year included (from 2000 till 2018) and after that convert date from dd-mm-yyyy to only a year number.

So far I got only imported the data:

df_spx = web.DataReader('^GSPC', 'yahoo', start='2000', end='2018')
df_spx.reset_index(inplace=True)
df_spx['Date'] = pd.to_datetime(df_spx['Date'])
df_spx

And the output is this (as a image in url, can't post pictures yet):

enter image description here

baldr
  • 2,891
  • 11
  • 43
  • 61
brankoo
  • 21
  • 6
  • 1
    Please make a [good example](https://stackoverflow.com/a/20159305/3620003). I'm not typing that image into my interpreter. Include the desired output. Five rows max. As executable code or copy-pasteable. – timgeb Jan 24 '22 at 15:15
  • 1
    @timgeb thank you for feedback. I already got answer below that worked and will improve my questions in future – brankoo Jan 24 '22 at 15:43
  • awesome, thank you! – timgeb Jan 24 '22 at 15:45

1 Answers1

0

You might want to order the dataframe by 'Date' columne.

Then, keep only Year value to group.

Finally, you can get latest rows after grouping the dataframe, as follows:

import pandas as pd

# sample dataframe
df = pd.DataFrame({
    'Date': ['2000-01-03', '2000-01-04', '2018-12-19', '2018-12-18'],
    'High': [1, 2, 3, 4],
})

df['Date'] = pd.to_datetime(df['Date'])
df.sort_values(by=['Date'], inplace=True)  # order by Date
df['Date'] = df['Date'].dt.year  # already ordered by Date, then leave year only.

df = df.groupby('Date').tail(1)  # get latest row per year

print(df)
#   Date  High
#1  2000     2 # latest row per 2000 (2000-01-04)
#2  2018     3 # latest row per 2018 (2018-12-19)
Park
  • 2,446
  • 1
  • 16
  • 25