1

Assume this is my dataframe. It's a sale data.

        date  date_block_num  shop_id  item_id  item_price  item_cnt_day
0 2013-01-02               0       59    22154      999.00          1.00
1 2013-01-03               0       25     2552      899.00          1.00
2 2013-01-05               1       25     2552      899.00         -1.00
3 2013-01-06               2       25     2554     1709.05          1.00
4 2013-01-15               2       28     2555     1099.00          1.00
5 2013-01-10               3       25     2564      349.00          1.00
6 2013-01-02               3       26     2565      549.00          1.00
7 2013-01-04               3       25     2572      239.00          1.00
8 2013-01-11               4       25     2572      299.00          1.00
9 2013-01-03               4       27     2573      299.00          3.00

So I'm trying to get all combinations or pairs of shop_id, item_id with respect to date_block_num column like my code below.

matrix = []

for i in range(5):
    sale = sales[sales.date_block_num==i]
    matrix.append(np.array(list(itertools.product([i], sales.shop_id.unique(), sales.item_id.unique())), dtype='int16'))

df = pd.DataFrame(np.vstack(matrix)) #This works but it's slow.

Any help on writing this same code without loops.

I tried to do something like this but it's too slow and return memory error when I turn it into dataframe on my original dataset.

from itertools import product
df = pd.DataFrame(list(product(sales.date_block_num.unique(), sales.shop_id.unique(), sales.item_id.unique())))

Note: Original dataset have more than million rows.

Jeeth
  • 2,226
  • 5
  • 24
  • 60
  • Side note related to the question, `itertools` is not vectorized. You need to be calling `numpy` methods to have what we generally mean when saying "vectorized" – roganjosh Nov 20 '18 at 21:10
  • "Original dataset have more than million rows" ... how big/small do you think your result is going to be? Do you know how many combinations you could have from 3 columns 1M elements in length? This question makes no sense to me. – cs95 Nov 20 '18 at 21:34
  • I'm marking as duplicate of https://stackoverflow.com/questions/11144513/numpy-cartesian-product-of-x-and-y-array-points-into-single-array-of-2d-points which has vectorised solutions. If that doesn't work, you will need to think of something smarter. – cs95 Nov 20 '18 at 21:36
  • @coldspeed Well for 2.8 million rows I've got `44484240 rows` as the result. Yes this is what I'am looking for. I tried the code you linked but it didn't work for me. Can you tell me why the downvote? – Jeeth Nov 20 '18 at 21:39
  • Okay, try the duplicate link which has some explosively fast solutions (explosive in terms of memory). By the way, the downvote is not a personal attack on you, it is just that your question is one that has been asked before and has many known solutions. – cs95 Nov 20 '18 at 21:42

1 Answers1

2

You can try the Pandas implementation for the Cartesian product:

cartesian = pd.core.reshape.util.cartesian_product

cart_cols = ['date_block_num', 'shop_id', 'item_id']
date, shop, item = cartesian([df[col].unique() for col in cart_cols])

res = pd.DataFrame({'date_block_num': date, 'shop': shop,'item': item})
jpp
  • 159,742
  • 34
  • 281
  • 339