Replicate rows to prepare Pandas DataFrame for date-based merge

Question

I have a pandas df with a few million rows of Item/Groups called items:

Item  Group
0001  A
0002  A
0003  B
0004  A
...

And I need to prepare items to be merged with a df called forecast which looks like this:

Item  Group  Month Forecast
0001  A      5     15
0001  A      6     16
0001  A      7     13
0002  A      5     60
0002  A      7     65

My merge will look something like: items.merge(forecast, on=['Item', 'Group', 'Month']

So my problem is preparing items to look something like this:

Item  Group  Month
0001  A      5
0001  A      6
0001  A      7
0002  A      5
0002  A      6
0002  A      7
...

So that for a given set of months (e.g 5, 6, 7), there are len(months) rows for each Item/Group combination, regardless of whether or not there's forecast data for that Item/Group/Month in forecast.

The order needs to be preserved (e.g. the three months for a given Item/Row need to be grouped together, rather than copies just appended to the end, and the items need to stay in order).

So far I've been looking into stack() as a possible option, but I haven't gotten anything to work (save for looping through the df, appending row * 2, to a new empty df as I go, which seems massively inefficient).

What's the right way to do this?

What is the new information that the merge will contain? NaN's for Item/Group/Month which have no forecast? — Charlie G, Apr 25 '17 at 20:45
@Charlie Yes. But I've simplified the dataframes. `items` and `forecast` also have other columns, so really the goal is to bring the 'forecast' info from `forecast` into `items` and create NaNs for all missing forecasts. Also, `items` is a superset of `forecasts`, so doing the op directly on `forecast` wouldn't create it for all items. — Sean Kramer, Apr 25 '17 at 20:49

piRSquared · Answer 1 · 2017-04-25T22:06:12.330

2

months = [5, 6, 7]

idx = items.index.repeat(len(months))
months_ = months * len(items)
items_ = items.loc[idx].assign(Month=months_).reset_index(drop=True)

print(items_)

    Item Group  Month
0      1     A      5
1      1     A      6
2      1     A      7
3      2     A      5
4      2     A      6
5      2     A      7
6      3     B      5
7      3     B      6
8      3     B      7
9      4     A      5
10     4     A      6
11     4     A      7

forecast.merge(items_)

   Item Group  Month  Forecast
0     1     A      5        15
1     1     A      6        16
2     1     A      7        13
3     2     A      5        60
4     2     A      7        65

edited Apr 25 '17 at 22:06

answered Apr 25 '17 at 22:01

piRSquared

285,575
57
475
624

This is the finessed pandas-ee answer I was imagining! I think I'm going to stick with the other one though, despite it likely being less efficient, b/c it's more explicit about what's going on. – Sean Kramer Apr 25 '17 at 22:07
@SeanKramer That's your choice (of course)... Hopefully this still proves useful :-) – piRSquared Apr 25 '17 at 22:09
Certainly useful! The repeat and assign functions are both things I haven't come across before that I probably could have used numerous times already. – Sean Kramer Apr 25 '17 at 22:12

score 1 · Answer 2 · edited May 23 '17 at 12:02

1

Try an outer merge

pd.merge(items, forecast, on=['Item', 'Group'], how='outer')

From here

If not, try this page for more inspiration: http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

edited May 23 '17 at 12:02

Community

1
1

answered Apr 25 '17 at 20:52

Charlie G

534
5
16

This doesn't address creating a sensible "Month" column in `items` that includes **all** possible Item/Group/Month combinations. That's the question here, not how to merge once that's done. – Sean Kramer Apr 25 '17 at 20:53
Months should be added in the merge if I'm doing it correctly...I think. I updated my answer to remove the `'Month'` key. Go to the outer merge in the link and see if that will do what you want. – Charlie G Apr 25 '17 at 21:00

score 1 · Accepted Answer · answered Apr 25 '17 at 21:46

Starting from the following dataframe, with columns Item and Group not set as Index, I have the following:

   Item Group
0  0001     A
1  0002     A
2  0003     B
3  0004     A

items_mod = pd.DataFrame()
for i in [5, 6, 7]:
    items['Month'] = i
    items_mod = items_mod.append(items)
items_mod = items_mod.sort_values('Item')

Which gives me the following dataframe:

   Item Group  Month
0  0001     A      5
0  0001     A      6
0  0001     A      7
1  0002     A      5
1  0002     A      6
1  0002     A      7
2  0003     B      5
2  0003     B      6
2  0003     B      7
3  0004     A      5
3  0004     A      6
3  0004     A      7

This is basically what I ended up with. The only issue is that it doesn't preserve the original ordering of the items (they're not actually in sorted order in the real DF). To address that, I just kept the index as a column and sorted on that afterward. Not sure why I thought it was more complicated than this. — Sean Kramer, Apr 25 '17 at 21:54

Replicate rows to prepare Pandas DataFrame for date-based merge

3 Answers3