0

I have a years of transaction data which I am working with by customer ids. The transaction information is at an invoice level and an id could easily have multiple invoices on the same day or not have invoices for years. I am attempting to create dataframes which contain sums of invoices by customer by each year, but also show years where invoices where not added. Something akin to:

tmp = invoices[invoice['invoice_year'].isin([2018,2019,2020]]
tmp = tmp.groupby(['id', pd.Grouper(key = 'invoice_date', freq = 'Y')])['sales'].sum()

This would return something akin to:

id    invoice_year    sales
1        2018       483982.20
1        2019       3453
1        2020       453533
2        2018       243
2        2020       23423
3        2020       2330202

However the desired output would be:

id    invoice_year    sales
1        2018       483982.20
1        2019       3453
1        2020       453533
2        2018       243
2        2019       nan
2        2020       23423
3        2018       nan
3        2019       nan
3        2020       2330202

Ideas?

WolVes
  • 1,286
  • 2
  • 19
  • 39
  • Does this answer your question? [pandas GroupBy columns with NaN (missing) values](https://stackoverflow.com/questions/18429491/pandas-groupby-columns-with-nan-missing-values) – crackaf Dec 11 '20 at 23:07
  • 1
    I do not believe so. No. In the above question we are generating `np.nans` from a lack of data, not replacing nans or managing existing nans. – WolVes Dec 11 '20 at 23:13
  • You could try the ```dropna = False``` option if you have the latest pandas – Stefan Dec 11 '20 at 23:15

2 Answers2

1

Let's suppose the original values are defined in the dataframe named df then you can try the following:

output = (df.groupby(['id', 'invoice_date'])['val'].sum()
                .unstack(fill_value=0)
                .stack()
                .reset_index(name='val'))

Otherwise you can previously create the column invoice_year:

df['invoice_year'] = df['invoice_date'].dt.year

And repeat the same code, this outputs:

   id  invoice_year  val
0   1          2018    1
1   1          2019    1
2   1          2020    0
3   2          2018    1
4   2          2019    0
5   2          2020    1
6   3          2018    0
7   3          2019    1
8   3          2020    1

Using the following data as example:

df = pd.DataFrame({'id':[1]*2+[2]*2+[3]*2,'invoice_date':pd.to_datetime(['2018-12-01','2019-12-01','2020-12-01']*2,infer_datetime_format=True),'val':[1]*6})
David Erickson
  • 16,433
  • 2
  • 19
  • 35
Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
1

Stefan has posted a comment that may help. Simply passing dropna=False to your .groupby seems like the best bet; but, you could also take the approach where you bring the NaNs back afterward, which may be required on earlier versions of pandas that don't have the dropna=False parameter:

id    invoice_year  sales
1        2018       483982.20
1        2019       3453
1        2020       453533
2        2018       243
2        2020       23423
3        2020       2330202

You can use pd.MultiIndex.from_product and reindex the dataframe from a newly created index called idx:

i, iy = df['id'], df['invoice_year']
idx = pd.MultiIndex.from_product([range(i.min(), i.max()+1), 
                                  range(iy.min(), iy.max()+1)],
                                  names=[i.name, iy.name])
df = df.set_index([i.name, iy.name]).reindex(idx).reset_index()
df
Out[1]: 
   id  invoice_year      sales
0   1          2018   483982.2
1   1          2019     3453.0
2   1          2020   453533.0
3   2          2018      243.0
4   2          2019        NaN
5   2          2020    23423.0
6   3          2018        NaN
7   3          2019        NaN
8   3          2020  2330202.0
David Erickson
  • 16,433
  • 2
  • 19
  • 35