Create pandas data frame with date index by concatenating multiple (date, price) data frames with some missing values

Question

I have the following date & price data for some grocery items, with the important note that some of these items have missing records for some months. For example, bread has data for all months of 2022, but juice, among others, has missing values. For example:

bread_dataframe

Date	Price
1/1/22	4.2
2/1/22	4.0
3/1/22	4.1
4/1/22	3.1
5/1/22	2.1
...	...

juice_dataframe

Date	Price
1/1/22	3.2
4/1/22	4.1
5/1/22	5.1
...	...

and so on..

My goal is to create a single dataframe with a date index and columns representing each item, with NaN showing for any month/price item we have no price for. Like so:

Date	Bread	Juice
1/1/22	4.2	3.2
2/1/22	4.0	NaN
3/1/22	4.1	NaN
4/1/22	3.2	4.1
5/1/22	2.1	5.1

So far, I have tried creating a dictionary of items_dict[item_name] = pd.DataFrame({'Date': dates, 'Price': item_price}) and then using pd.concat(items_dict, axis=1) but this only returns this sort of dataframe.

Date	('Bread', 'Price')	('Juice', 'Price')	...
1/1/22
2/1/22
3/1/22
4/1/22
5/1/22

Any help is much appreciated.

Shubham Garg · Answer 1 · 2023-01-30T15:40:33.397

In both the bread dataframe and the juice dataframe, convert the date column to a datetime object. Then, in both dataframes, make the date column the index. To distinguish between duplicate column names, use the pd.merge function to merge the two dataframes on the date index, and set the suffixes parameter. Finally, pivot the merged dataframe using the pd.pivot method, with the date as the index, item names as columns, and price as values.

Here's the code:

import pandas as pd

bread_dataframe['Date'] = pd.to_datetime(bread_dataframe['Date'], format='%m/%d/%y')
juice_dataframe['Date'] = pd.to_datetime(juice_dataframe['Date'], format='%m/%d/%y')

bread_dataframe.set_index('Date', inplace=True)
juice_dataframe.set_index('Date', inplace=True)

merged_df = pd.merge(bread_dataframe, juice_dataframe, left_index=True, right_index=True, suffixes=('_bread', '_juice'))
pivoted_df = merged_df.pivot(columns='item', values='Price')

This should produce the desired single dataframe with a date index and columns representing each item, with NaN displaying for any month/price item for which we do not have a price.

EDIT: For multiple dataframes, you can just use a for loop to iterate over the list of dataframes.

import pandas as pd
dataframes = [bread_dataframe, juice_dataframe, ...]
result = None
for i, df in enumerate(dataframes):
    df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%y')
    df.set_index('Date', inplace=True)
    if i == 0:
        result = df
    else:
        result = pd.merge(result, df, left_index=True, right_index=True, suffixes=('', f'_{i}'))            
pivoted_df = result.pivot(columns='item', values='Price')

Thank you for the input, but I am dealing with multiple dataframes and not just the two I mentioned above. Is there a an approach to this that would work with 20+ items, for example? — jonisate85, Jan 30 '23 at 15:10
@jonisate85 Yes, you can use a for loop for this. I have updated the answer with the code. — Shubham Garg, Jan 30 '23 at 15:41

Create pandas data frame with date index by concatenating multiple (date, price) data frames with some missing values

1 Answers1