I'm currently trying to predict the next sequence of goods a customer is likely to buy in the next time period. The following example is for illustrative purposes(my actual dataset has around 6 mil customer ids and 5000 different products)
My current data looks like the following:
date customer_nbr products_bought
201701 123 ["product_1","product_5","product_15"]
201704 123 ["product_4","product_10","product_11"]
201721 123 ["product_1","product_6"]
201713 456 ["product_7","sproduct_11","product_12","product_15"]
201714 456 ["product_1","product_3"]
201721 456 ["product_4","product_9","product_10","product_13","product_15"]
where the frequency of data is by week. So customer_id 123 bought items "product_1","product_5" and "product_15" in the first week of 2017(so there are up to 52 weeks for a given year). After lagging to get my input variable, my final dataframe looks like:
date customer_nbr products_bought_last_period products_bought
201704 123 ["product_1","product_5","product_15"] ["product_4","product_10","product_11"]
201721 123 ["product_4","product_10","product_11"] ["product_1","product_6"]
201714 456 ["product_7","sproduct_11","product_12","product_15"] ["product_1","product_3"]
201721 456 ["product_1","product_3"]
["product_4","product_9","product_10","product_13","product_15"]
thus for my seq2seq model I would want to predict the sequence of products bought for date 201721 by customer using products_bought_last_period
, thus products_bought_last_period
is my input, products_bought
is now my target variable.
I then encoded my product ids and padded my products_bought_last_period
and products_bought
arrays in my dataframe(based on array with the most products). Afterwards, I converted everything in np.arrays. Lastly, the total number of products in my actual dataset is 5000 so I set total_nbr_of_products = 5000
and I tried doing the following:
train = df[df['date'] < 201721].set_index('date')
test = df[df['date'] >= 201721].set_index('date')
X = train["products_bought_last_period"].copy()
X_test = test["products_bought_last_period"].copy()
y = train['products_bought'].copy()
y_test = test['products_bought'].copy()
X = np.array(X)
X_test = np.array(X_test)
y = np.array(y)
y_test = np.array(y_test)
# Encoder model
total_nbr_of_products = 5000
encoder_input = Input(shape=(None,total_nbr_of_products))
encoder_LSTM = LSTM(256,return_state = True)
encoder_outputs, encoder_h, encoder_c = encoder_LSTM (encoder_input)
encoder_states = [encoder_h, encoder_c]
# Decoder model
decoder_input = Input(shape=(None,total_nbr_of_products))
decoder_LSTM = LSTM(256,return_sequences=True, return_state = True)
decoder_out, _ , _ = decoder_LSTM(decoder_input, initial_state=encoder_states)
decoder_dense = Dense(total_nbr_of_products,activation='softmax')
decoder_out = decoder_dense (decoder_out)
model = Model(inputs=[encoder_input, decoder_input],outputs=[decoder_out])
model = Model(inputs=[encoder_input, decoder_input],outputs=[decoder_out])
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit(X,y,
validation_data=(X_test, y_test),
batch_size=64,
epochs=5)
however, when I tried doing that I got the following error:
ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays: [array([[array([1209, 2280, 1066, 3308, 3057, 4277, 3000, 4090, 0, 0, 0,
I'm not sure about two main things:
1.) what I might be doing wrong as far as matching up my dimensions
2.) and whether my seq2seq approach is correct to begin with
ideally I'm looking to predict the next basket of goods a customer(for around 6 mil customers) is likely to buy. I'd greatly appreciate any assistance