2

I've been working on a Recommender System and so far we just built a simple algorithm, the Neighborhood approach, to work on an e-commerce store.

The system was built in Python to make extensive use of its mathematical libraries and it works in a very straightforward way: it receives as input a list of what products the customer navigated and the output is a list of the top 10 recommended items.

To do so we implemented 2 scripts in python: one is the web service in which we used Tornado to receive the URLs and the second one is Neighborhood itself in which we have our similarity matrices to compute the recommendation. They communicate between themselves by Inter Process Communication (IPC) and we used the built-in library multiprocessing to do so.

We created our webservice like so:

import sys
sys.path.append("..")

import tornado.httpserver
import tornado.ioloop
import tornado.web
from tornado.escape import json_encode
from shared.bootstrap import *

import argparse

from clients import ClientFactory, ClientNotFoundException

class WService(tornado.web.RequestHandler):

    _clients = {}

    def get(self, algorithm = None):

        algorithm = 'neighborhood' if not algorithm else algorithm
        rec_list = []

        if algorithm == 'favicon.ico':
            algorithm = 'neighborhood'

        print "value of algorithm %s" %(algorithm)

        try:

            if not algorithm in self._clients:

                self._clients[algorithm] = ClientFactory.get_instance(algorithm)        

            arguments = self.get_arguments_by_client(self._clients[algorithm].get_expected_arguments())


            rec_list = self._clients[algorithm].call(arguments)

        except ClientNotFoundException as err:
            error("Erro " + str(err))

        except Exception as err:
            error("Erro: " + str(err))
            self._clients[algorithm] = ClientFactory.get_instance(algorithm)

        rec_dict = {"skus" : [str(sku) for sku in rec_list]}
        self.write(json_encode(rec_dict))

    def get_arguments_by_client(self, expected_arguments):
        arguments = {}
        for key in expected_arguments:
            arguments[key] = self.get_argument(key, expected_arguments[key])

        return arguments

application = tornado.web.Application([
                                       (r"/(.*)", WService),
                                       ])


def parse_command_line_params():

    parser = argparse.ArgumentParser()

    parser.add_argument('--port', help="Service running port number",  required=True)
    return parser.parse_args()

if __name__ == "__main__":

    http_server = tornado.httpserver.HTTPServer(application)
    cmd_args = parse_command_line_params()
    http_server.listen(cmd_args.port)
    tornado.ioloop.IOLoop.instance().start()

Basically when Tornado receives a GET request our ClientFactory initiates our Neighborhood application if it's not up yet and starts the IPC (to do so we followed this thread on SO, second answer).

So when Tornado receives an URL it basically parses it, send the list of items through IPC to our Neighborhood application, this in turn process the information and sends the result back through the same IPC to Tornado that finally outputs in a JSON format the chosen products.

As an example here's an URL sent to Tornado:

http://localhost:8000/?skus_navigated=PR840ACG60NNV,BO185SHF79ZRG,BO185SHF99OBK&skus_carted=&skus_purchased=AN658APF41AIC&category=49

The maximum items sent is 10 products in each parameter. If they end up being all null or something unrecognizable the default response is a null JSON.

Our Neighborhood algorithm has a sparse Scipy matrix and currently it has shape around the values of (300k,300k). If we send for instance the parameter:

skus_navigated = PR840ACG60NNV,BO185SHF79ZRG,BO185SHF99OBK

The skus PR840ACG60NNV ,BO185SHF79ZRG ,BO185SHF99OBK are internally mapped (if they were observed in data) and they receive a given score (let's say 1.0), so if their mapping is something like:

PR840ACG60NNV = 7
BO185SHF79ZRG = 5000
BO185SHF99OBK = 300

Then we create a vector v such as:

v = zeros(300k)
v[[7, 300, 5000]] = 1.0

And then we use v to multiply with our scipy matrix to get the top recommendations.

After that we tried to go live with the system and it worked for a few minutes, but then the response time started to increase considerably to the point where we had to turn off the system.

Even though it takes only 10 ms on average for the system from parsing the URL to output the JSON result something happens that the system starts to take longer and longer to respond and breaks down.

There seems to be no problem in our infra-structure and we are using some virtual machines that can handle our amount of requests.

So I'd like to ask you if there's a way to use Tornado to make this IPC and connect to our Neighborhood, maybe in some asynchronous manner, to avoid the queueing of requests and eventually its break down.

Is there some better way of building this system between Tornado and the Neighborhood?

We thought that loading our Neighborhood matrices in tornado would make it somewhat unstable because if one breaks down so would the other one. But the solution we created so far seems to have made the system even less stable, even though it's considerably fast to process requests.

I appreciate your help and if you need more information please let me know.

Thanks in advance,

Community
  • 1
  • 1
Willian Fuks
  • 11,259
  • 10
  • 50
  • 74
  • I think the basic issue is that you're running a expensive algorithm on demand. Can't you precalculate values? Where does the list of items come from - a database? Do `GET`s represent a single item in the list? – loopbackbee Dec 27 '13 at 14:01
  • @goncalopp The list of values come from URLs obtained from cookies from each user. The `GET` might take several products as input up to 10 in each parameter (I've edited the question with this info). – Willian Fuks Dec 27 '13 at 14:17
  • Does the algorithm consider only the skus sent by the `GET` to calculate the recommendations, or is there a internal database that is queried to get others? – loopbackbee Dec 27 '13 at 15:28
  • @goncalopp Internally we have a Scipy Matrix where each row represents a different sku so that when we process the algebric operations the result is a list with all skus we observed in the data to build the neighborhood matrix. – Willian Fuks Dec 27 '13 at 16:19
  • what I meant to ask is if those rows represent *only* the data coming from the GET - or is there other source of information? If your `GET` sends you 10 items, your matrix will have 10 rows/columns? Also, do you need the recommendation right away (on the `GET` response), or can you query the server later? – loopbackbee Dec 27 '13 at 16:23
  • @goncalopp Our matrix is currently (300k,300k) and it's build by observing some past behavior from users. If 4 items are sent then we create a vector with 300k null values and 4 non-null values in it and multiply both to get the top recommendations. I'll add this info in the post – Willian Fuks Dec 27 '13 at 16:27
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/44002/discussion-between-goncalopp-and-will) – loopbackbee Dec 27 '13 at 16:30

0 Answers0