2

I have a pandas df with a few million rows of Item/Groups called items:

Item  Group
0001  A
0002  A
0003  B
0004  A
...

And I need to prepare items to be merged with a df called forecast which looks like this:

Item  Group  Month Forecast
0001  A      5     15
0001  A      6     16
0001  A      7     13
0002  A      5     60
0002  A      7     65

My merge will look something like: items.merge(forecast, on=['Item', 'Group', 'Month']

So my problem is preparing items to look something like this:

Item  Group  Month
0001  A      5
0001  A      6
0001  A      7
0002  A      5
0002  A      6
0002  A      7
...

So that for a given set of months (e.g 5, 6, 7), there are len(months) rows for each Item/Group combination, regardless of whether or not there's forecast data for that Item/Group/Month in forecast.

The order needs to be preserved (e.g. the three months for a given Item/Row need to be grouped together, rather than copies just appended to the end, and the items need to stay in order).

So far I've been looking into stack() as a possible option, but I haven't gotten anything to work (save for looping through the df, appending row * 2, to a new empty df as I go, which seems massively inefficient).

What's the right way to do this?

Sean Kramer
  • 133
  • 7
  • 1
    What is the new information that the merge will contain? NaN's for Item/Group/Month which have no forecast? – Charlie G Apr 25 '17 at 20:45
  • @Charlie Yes. But I've simplified the dataframes. `items` and `forecast` also have other columns, so really the goal is to bring the 'forecast' info from `forecast` into `items` and create NaNs for all missing forecasts. Also, `items` is a superset of `forecasts`, so doing the op directly on `forecast` wouldn't create it for all items. – Sean Kramer Apr 25 '17 at 20:49

3 Answers3

2
months = [5, 6, 7]

idx = items.index.repeat(len(months))
months_ = months * len(items)
items_ = items.loc[idx].assign(Month=months_).reset_index(drop=True)

print(items_)

    Item Group  Month
0      1     A      5
1      1     A      6
2      1     A      7
3      2     A      5
4      2     A      6
5      2     A      7
6      3     B      5
7      3     B      6
8      3     B      7
9      4     A      5
10     4     A      6
11     4     A      7

forecast.merge(items_)

   Item Group  Month  Forecast
0     1     A      5        15
1     1     A      6        16
2     1     A      7        13
3     2     A      5        60
4     2     A      7        65
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • This is the finessed pandas-ee answer I was imagining! I think I'm going to stick with the other one though, despite it likely being less efficient, b/c it's more explicit about what's going on. – Sean Kramer Apr 25 '17 at 22:07
  • @SeanKramer That's your choice (of course)... Hopefully this still proves useful :-) – piRSquared Apr 25 '17 at 22:09
  • Certainly useful! The repeat and assign functions are both things I haven't come across before that I probably could have used numerous times already. – Sean Kramer Apr 25 '17 at 22:12
1

Try an outer merge

pd.merge(items, forecast, on=['Item', 'Group'], how='outer')

From here

If not, try this page for more inspiration: http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

Community
  • 1
  • 1
Charlie G
  • 534
  • 5
  • 16
  • This doesn't address creating a sensible "Month" column in `items` that includes **all** possible Item/Group/Month combinations. That's the question here, not how to merge once that's done. – Sean Kramer Apr 25 '17 at 20:53
  • Months should be added in the merge if I'm doing it correctly...I think. I updated my answer to remove the `'Month'` key. Go to the outer merge in the link and see if that will do what you want. – Charlie G Apr 25 '17 at 21:00
1

Starting from the following dataframe, with columns Item and Group not set as Index, I have the following:

   Item Group
0  0001     A
1  0002     A
2  0003     B
3  0004     A

items_mod = pd.DataFrame()
for i in [5, 6, 7]:
    items['Month'] = i
    items_mod = items_mod.append(items)
items_mod = items_mod.sort_values('Item')

Which gives me the following dataframe:

   Item Group  Month
0  0001     A      5
0  0001     A      6
0  0001     A      7
1  0002     A      5
1  0002     A      6
1  0002     A      7
2  0003     B      5
2  0003     B      6
2  0003     B      7
3  0004     A      5
3  0004     A      6
3  0004     A      7
Eric B
  • 1,635
  • 4
  • 13
  • 27
  • This is basically what I ended up with. The only issue is that it doesn't preserve the original ordering of the items (they're not actually in sorted order in the real DF). To address that, I just kept the index as a column and sorted on that afterward. Not sure why I thought it was more complicated than this. – Sean Kramer Apr 25 '17 at 21:54