6

I'm currently trying to predict the next sequence of goods a customer is likely to buy in the next time period. The following example is for illustrative purposes(my actual dataset has around 6 mil customer ids and 5000 different products)

My current data looks like the following:

date  customer_nbr  products_bought
201701  123 ["product_1","product_5","product_15"]
201704  123 ["product_4","product_10","product_11"]
201721  123 ["product_1","product_6"]
201713  456 ["product_7","sproduct_11","product_12","product_15"]
201714  456 ["product_1","product_3"]
201721  456 ["product_4","product_9","product_10","product_13","product_15"]

where the frequency of data is by week. So customer_id 123 bought items "product_1","product_5" and "product_15" in the first week of 2017(so there are up to 52 weeks for a given year). After lagging to get my input variable, my final dataframe looks like:

date  customer_nbr  products_bought_last_period   products_bought
201704  123 ["product_1","product_5","product_15"]  ["product_4","product_10","product_11"]
201721  123 ["product_4","product_10","product_11"]  ["product_1","product_6"]
201714  456 ["product_7","sproduct_11","product_12","product_15"]   ["product_1","product_3"]
201721  456 ["product_1","product_3"]  
["product_4","product_9","product_10","product_13","product_15"]

thus for my seq2seq model I would want to predict the sequence of products bought for date 201721 by customer using products_bought_last_period, thus products_bought_last_period is my input, products_bought is now my target variable. I then encoded my product ids and padded my products_bought_last_period and products_boughtarrays in my dataframe(based on array with the most products). Afterwards, I converted everything in np.arrays. Lastly, the total number of products in my actual dataset is 5000 so I set total_nbr_of_products = 5000 and I tried doing the following:

train = df[df['date'] < 201721].set_index('date')
test = df[df['date'] >= 201721].set_index('date')
X = train["products_bought_last_period"].copy()  
X_test = test["products_bought_last_period"].copy()


y = train['products_bought'].copy()  
y_test = test['products_bought'].copy()


X = np.array(X)
X_test = np.array(X_test)
y = np.array(y)
y_test = np.array(y_test)

# Encoder model
total_nbr_of_products = 5000
encoder_input = Input(shape=(None,total_nbr_of_products))
encoder_LSTM = LSTM(256,return_state = True)
encoder_outputs, encoder_h, encoder_c = encoder_LSTM (encoder_input)
encoder_states = [encoder_h, encoder_c]
# Decoder model
decoder_input = Input(shape=(None,total_nbr_of_products))
decoder_LSTM = LSTM(256,return_sequences=True, return_state = True)
decoder_out, _ , _ = decoder_LSTM(decoder_input, initial_state=encoder_states)
decoder_dense = Dense(total_nbr_of_products,activation='softmax')
decoder_out = decoder_dense (decoder_out)


model = Model(inputs=[encoder_input, decoder_input],outputs=[decoder_out])

model = Model(inputs=[encoder_input, decoder_input],outputs=[decoder_out])
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit(X,y, 
          validation_data=(X_test, y_test),
          batch_size=64,
          epochs=5)

however, when I tried doing that I got the following error:

ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays: [array([[array([1209, 2280, 1066, 3308, 3057, 4277, 3000, 4090,    0,    0,    0,

I'm not sure about two main things:

1.) what I might be doing wrong as far as matching up my dimensions

2.) and whether my seq2seq approach is correct to begin with

ideally I'm looking to predict the next basket of goods a customer(for around 6 mil customers) is likely to buy. I'd greatly appreciate any assistance

M3105
  • 519
  • 2
  • 7
  • 20
  • In case you are still interested in this: I sgugest looking at https://github.com/HaojiHu/Sets2Sets – tandem Jun 26 '20 at 16:08

1 Answers1

0
  1. What I might be doing wrong as far as matching up my dimensions?

See how your model is defined.

model = Model(inputs=[encoder_input, decoder_input], outputs=[decoder_out])

You need two inputs ([encoder_input, decoder_input] and decoder_out) to fit your data. Your model.fit() should look as follows,

model.fit([train_encoder_input, train_decoder_input], train_decoder_output)
  1. Is seq2seq correct here?

To me, this seems to be the unconventional use of seq2seq, but fine. You have to see if lagging by 1 is the optimal choice and will have to one-hot encode your list of product purchased.

EDIT: Added a simple example below.

There are a couple of excellent examples explained if you look into the following links. Refer to these for further inquiries on seq2seq with keras.

https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

https://github.com/philipperemy/keras-seq2seq-example


For the illustrative purpose, I have written a small example. Let us consider the case where we want to transform a string to a string. Say, we are introducing a new postal code system.

import numpy as np
from keras.layers import Input, LSTM, Dense
from keras.models import Model

df = {'old': ['ABCDBA', 'EFFEBA', 'CDDCAA', 'BBBBAA', 'FFDDCD', 'DCFEEF', 
              'AAFFBA'],
      'new': ['XYX', 'ZZX', 'YYX', 'XXX', 'ZYY', 'YZZ', 'XZX']}

For my convenience, I have set the number of tokens and length of sequence fixed. We set the beginning('M') and ending character('N') for the data fed into the decoder.

encoder_texts = [[char for char in word] for word in df['old']]
decoder_texts = [[char for char in word] for word in df['new']]

old_char = ['A', 'B', 'C', 'D', 'E', 'F']
new_char = ['M', 'N', 'X', 'Y', 'Z']

encoder_seq_length = 6
decoder_seq_length = 4
num_encoder_tokens = len(old_char)
num_decoder_tokens = len(new_char)

old_token_index = dict((c, i) for i, c in enumerate(old_char))
new_token_index = dict((c, i) for i, c in enumerate(new_char))

Take the example of 'XYZ'. As an input to the decoder, it is 'MXYZ' and as an output from the decoder, it is 'XYZN'. Eventually we have to one-hot encode this sequence of characters anyway, so I do this altogether as follows,

encoder_input_data = np.zeros((7, encoder_seq_length, num_encoder_tokens), dtype='float32')
decoder_input_data = np.zeros((7, decoder_seq_length, num_decoder_tokens), dtype='float32')
decoder_output_data = np.zeros((7, decoder_seq_length, num_decoder_tokens), dtype='float32')

for i, (encoder_text, decoder_text) in enumerate(zip(encoder_texts, decoder_texts)):
    for t, char in enumerate(encoder_text):
        encoder_input_data[i, t, old_token_index[char]] = 1
    for t, char in enumerate(decoder_text):
        decoder_input_data[i, t+1, new_token_index[char]] = 1

        if t > 0:
            decoder_output_data[i, t-1, new_token_index[char]] = 1

        decoder_input_data[i, 0, new_token_index['M']] = 1
        decoder_output_data[i, 3, new_token_index['N']] = 1

Then, you can proceed with your code.

encoder_input = Input(shape=(None, num_encoder_tokens))
encoder_LSTM = LSTM(units=128, return_state = True)
encoder_output, encoder_h, encoder_c = encoder_LSTM(encoder_input)
encoder_states = [encoder_h, encoder_c]

decoder_input = Input(shape=(None, num_decoder_tokens))
decoder_LSTM = LSTM(128, return_sequences=True, return_state=True)
decoder_output, _, _ = decoder_LSTM(decoder_input, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_output = decoder_dense(decoder_output)

model = Model(inputs=[encoder_input, decoder_input], outputs=[decoder_output])
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit([encoder_input_data, decoder_input_data], decoder_output_data)

To answer your second question a little more, you can use your lagged series of product purchased for your decoder input and output. I do not have a theoretical basis for this, but two consequential sequences sharing a state through seq2seq scheme seems okay.(at least to give it a go)

Kidae Kim
  • 499
  • 2
  • 9
  • @kidae_Kim thank you for your response. Apologies if this is a dumb question, but how would I define my encoder/decoder so that I can incorporate [train_encoder_input, train_decoder_input], train_decoder_output into the model.fit()? – M3105 Nov 03 '19 at 01:45
  • @M3105, no problem. I have edited with a small example. Look through the examples linked, too. – Kidae Kim Nov 04 '19 at 05:06