1

Simple example dataframe

df = pd.DataFrame({
    'year': [1900, 1901, 1901, 1901, 1902, 1903, 1903, 1903, 1905]
})

I have the below function that takes in a pandas dataframe:

def my_function(df):
    df = df.groupby(['year']) # group the df by year
    new_df = pd.DataFrame() # make a new empty df
    new_df['frequency'] = df['year'].count() # get frequency counts for each year
    return new_df

However, the output for this doesn't give me a 0 frequency count for the missing years.

Ideal output of my_function(df):

year frequency
1900 1
1901 3
1902 1
1903 3
1904 0
1905 1
Current output of my_function(df):
1900 1
1901 3
1902 1
1903 3
1905 1

I think I'm close with pd.reindex() but need some direction.

I've scanned the docs for pd.reindex() and tried looking at this stackoverflow post as well as this one and I still haven't been able to solve it.

I've defined a range of ideal years in a new variable

new_idx = range(1900, 1905)

And then tried implementing pd.reindex() like so: new_df.reindex(new_idx, fill_value=0)

This resulted in a slightly different function that now looks like this:

def my_function(df):
    new_idx = range(1900, 1905)
    df = df.groupby(['year'])
    new_df = pd.DataFrame()
    new_df['frequency'] = df['year'].count()
    new_df = new_df.reindex(new_idx, fill_value=0)
    return new_df

However, this results in a new pd.dataframe() that is the size that I'd like (length of the years in new_idx) but it overrides all frequency values to be 0 instead of just the "added" years.

Ideal output of slightly tweaked my_function(df):

year frequency
1900 1
1901 3
1902 1
1903 3
1904 0
1905 1

Current output of slightly tweaked my_function(df):

year frequency
1900 0
1901 0
1902 0
1903 0
1904 0
1905 0

2 Answers2

1

you want Series.value_counts + Series.reindex

new_df=( df['year'].value_counts()
                   .reindex(range(df['year'].min(),df['year'].max()+1),fill_value=0)
                   .rename_axis('year')
                   .reset_index(name='frecuency')
                    )
print(new_df)

       year  frecuency
0  1900          1
1  1901          3
2  1902          1
3  1903          3
4  1904          0
5  1905          1
ansev
  • 30,322
  • 5
  • 17
  • 31
  • This looks really close, and works on the small dataset given. However, I would ideally be able to apply this to dataframes of varying length as the years might be missing at the beginning or the end of the range. So instead of setting the range to be the min() and max() of the df argument, I'd like it to be say, of range(1986, 2019). However, this seems to result in the same error in the original post (i.e. all frequencies are replaced with 0s). Any ideas? – trombonebraveheart Oct 22 '19 at 06:17
  • It could be due to the type of the . Are you using str float or int? – ansev Oct 22 '19 at 13:24
  • I am using type int. For me if I use the test df posted then I get all zeros in the 'year' Series when trying to expand the index by a larger range. – trombonebraveheart Oct 22 '19 at 23:50
0

You could use pd.date_range + pd.Series.value_counts:

import pandas as pd

df = pd.DataFrame({
    'year': [1900, 1901, 1901, 1901, 1902, 1903, 1903, 1903, 1905]
})

# generate date range between minimum and maximum year, with yearly frequency
range = pd.date_range(start=pd.datetime(df.min(), 1, 1), end=pd.datetime(df.max(), 1, 1), freq='AS')

# fill each year with the corresponding count set missing to zero
result = pd.Series(range.year, index=range.year).map(df.year.value_counts()).fillna(0)

print(result)

Output

1900    1.0
1901    3.0
1902    1.0
1903    3.0
1904    0.0
1905    1.0
dtype: float64

The idea is to generate a yearly date range between the first and the last year and map the frequencies of the years to the this range.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76