Simple example dataframe
df = pd.DataFrame({
'year': [1900, 1901, 1901, 1901, 1902, 1903, 1903, 1903, 1905]
})
I have the below function that takes in a pandas dataframe:
def my_function(df):
df = df.groupby(['year']) # group the df by year
new_df = pd.DataFrame() # make a new empty df
new_df['frequency'] = df['year'].count() # get frequency counts for each year
return new_df
However, the output for this doesn't give me a 0
frequency count for the missing years.
Ideal output of my_function(df):
year frequency
1900 1
1901 3
1902 1
1903 3
1904 0
1905 1
Current output of my_function(df):
1900 1
1901 3
1902 1
1903 3
1905 1
I think I'm close with pd.reindex() but need some direction.
I've scanned the docs for pd.reindex() and tried looking at this stackoverflow post as well as this one and I still haven't been able to solve it.
I've defined a range of ideal years in a new variable
new_idx = range(1900, 1905)
And then tried implementing pd.reindex()
like so:
new_df.reindex(new_idx, fill_value=0)
This resulted in a slightly different function that now looks like this:
def my_function(df):
new_idx = range(1900, 1905)
df = df.groupby(['year'])
new_df = pd.DataFrame()
new_df['frequency'] = df['year'].count()
new_df = new_df.reindex(new_idx, fill_value=0)
return new_df
However, this results in a new pd.dataframe() that is the size that I'd like (length of the years in new_idx) but it overrides all frequency values to be 0 instead of just the "added" years.
Ideal output of slightly tweaked my_function(df)
:
year frequency
1900 1
1901 3
1902 1
1903 3
1904 0
1905 1
Current output of slightly tweaked my_function(df)
:
year frequency
1900 0
1901 0
1902 0
1903 0
1904 0
1905 0