I've been working on a Recommender System and so far we just built a simple algorithm, the Neighborhood approach, to work on an e-commerce store.
The system was built in Python to make extensive use of its mathematical libraries and it works in a very straightforward way: it receives as input a list of what products the customer navigated and the output is a list of the top 10 recommended items.
To do so we implemented 2 scripts in python: one is the web service in which we used Tornado to receive the URLs and the second one is Neighborhood itself in which we have our similarity matrices to compute the recommendation. They communicate between themselves by Inter Process Communication (IPC) and we used the built-in library multiprocessing to do so.
We created our webservice like so:
import sys
sys.path.append("..")
import tornado.httpserver
import tornado.ioloop
import tornado.web
from tornado.escape import json_encode
from shared.bootstrap import *
import argparse
from clients import ClientFactory, ClientNotFoundException
class WService(tornado.web.RequestHandler):
_clients = {}
def get(self, algorithm = None):
algorithm = 'neighborhood' if not algorithm else algorithm
rec_list = []
if algorithm == 'favicon.ico':
algorithm = 'neighborhood'
print "value of algorithm %s" %(algorithm)
try:
if not algorithm in self._clients:
self._clients[algorithm] = ClientFactory.get_instance(algorithm)
arguments = self.get_arguments_by_client(self._clients[algorithm].get_expected_arguments())
rec_list = self._clients[algorithm].call(arguments)
except ClientNotFoundException as err:
error("Erro " + str(err))
except Exception as err:
error("Erro: " + str(err))
self._clients[algorithm] = ClientFactory.get_instance(algorithm)
rec_dict = {"skus" : [str(sku) for sku in rec_list]}
self.write(json_encode(rec_dict))
def get_arguments_by_client(self, expected_arguments):
arguments = {}
for key in expected_arguments:
arguments[key] = self.get_argument(key, expected_arguments[key])
return arguments
application = tornado.web.Application([
(r"/(.*)", WService),
])
def parse_command_line_params():
parser = argparse.ArgumentParser()
parser.add_argument('--port', help="Service running port number", required=True)
return parser.parse_args()
if __name__ == "__main__":
http_server = tornado.httpserver.HTTPServer(application)
cmd_args = parse_command_line_params()
http_server.listen(cmd_args.port)
tornado.ioloop.IOLoop.instance().start()
Basically when Tornado receives a GET
request our ClientFactory initiates our Neighborhood application if it's not up yet and starts the IPC (to do so we followed this thread on SO, second answer).
So when Tornado receives an URL it basically parses it, send the list of items through IPC to our Neighborhood application, this in turn process the information and sends the result back through the same IPC to Tornado that finally outputs in a JSON format the chosen products.
As an example here's an URL sent to Tornado:
http://localhost:8000/?skus_navigated=PR840ACG60NNV,BO185SHF79ZRG,BO185SHF99OBK&skus_carted=&skus_purchased=AN658APF41AIC&category=49
The maximum items sent is 10 products in each parameter. If they end up being all null or something unrecognizable the default response is a null JSON.
Our Neighborhood algorithm has a sparse Scipy matrix and currently it has shape around the values of (300k,300k). If we send for instance the parameter:
skus_navigated = PR840ACG60NNV,BO185SHF79ZRG,BO185SHF99OBK
The skus PR840ACG60NNV ,BO185SHF79ZRG ,BO185SHF99OBK
are internally mapped (if they were observed in data) and they receive a given score (let's say 1.0), so if their mapping is something like:
PR840ACG60NNV = 7
BO185SHF79ZRG = 5000
BO185SHF99OBK = 300
Then we create a vector v
such as:
v = zeros(300k)
v[[7, 300, 5000]] = 1.0
And then we use v
to multiply with our scipy matrix to get the top recommendations.
After that we tried to go live with the system and it worked for a few minutes, but then the response time started to increase considerably to the point where we had to turn off the system.
Even though it takes only 10 ms on average for the system from parsing the URL to output the JSON result something happens that the system starts to take longer and longer to respond and breaks down.
There seems to be no problem in our infra-structure and we are using some virtual machines that can handle our amount of requests.
So I'd like to ask you if there's a way to use Tornado to make this IPC and connect to our Neighborhood, maybe in some asynchronous manner, to avoid the queueing of requests and eventually its break down.
Is there some better way of building this system between Tornado and the Neighborhood?
We thought that loading our Neighborhood matrices in tornado would make it somewhat unstable because if one breaks down so would the other one. But the solution we created so far seems to have made the system even less stable, even though it's considerably fast to process requests.
I appreciate your help and if you need more information please let me know.
Thanks in advance,