API design for machine learning models

Question

Using Django rest framework to build an API webservice that contains many of already trained machine learning models. Some models can predict a batch_size of 1 or an image at a time. Others need a history of data (timelines) to be able to predict/forecasts. Usually these timelines can hardly fit and passed as parameter. Being that, we want to give the requester the ability to request by either:

sending the data (small batches) to predict as parameter.
passing a database id/reference as parameter then the API will query the database and do the predictions.

So the question is, what would be the best API design for identifying which approach the requester chose?. Some considered approaches:

Add /db to the path of the endpoint ex: POST models/<X>/db. The problem with this approach is that (2x) endpoints are generated for each model.
Add parameter db as boolean to each request. The problem with such approach is that it adds additional overhead for each request just to check which approach. Also, make the code less readable.
Global variable set for each requester when signed for the API token. The problem is that you restricted the requester for 1 mode which is not convenient.

What would be the best approach for this case

"Usually these timelines can hardly fit and passed as parameter" -- However large your data may be, why can't you use POST for all prediction tasks? Is your data in the order of GBs? I believe that browsers allow a couple of GBs at max ([relevant link](https://stackoverflow.com/questions/2880722/is-http-post-limitless)) — doodhwala, Aug 03 '18 at 20:30
Regarding the second method, what use case are you supporting that the requester has the internal db identifier? I would expect the user to be providing all their data at the time of prediction. — doodhwala, Aug 03 '18 at 20:32

score 0 · Answer 1 · answered Aug 07 '18 at 16:25

The fact that you currently have more than one source would cause me to seriously consider attempting to abstract the "source" component as much as possible, to allow all manner of sources. For example, suppose that future users would like to pull data out of a mongodb, instead of a whatever db you currently are using? Or from some other storage structure? Or pull from a third party? Or, or, or....

In any case the question is now "how much do they all have in common, and what should they all implement?"

class Source(object):
    def __get_batch__(self, batch_size=1):
        raise NotImplementedError() #each source needs to implement this on its own

@http_library.POST_endpoint("/db")
class DBSource(Source):
    def __init__(self, post_data):
        if post_data["table"] in ["data1", "data2"]:
            self.table = table
        else:
            raise Exception("Must use predefined table to prevent SQL injection")
    def __get_batch__(self, batch_size=1):
        return sql_library.query("SELECT * FROM {} LIMIT ?".format(self.table), batch_size)

@http_library.POST_endpoint("/local")
class LocalSource(Source):
    def __init__(self, post_data):
        self.data = post_data["data"]
    def __get_batch__(self, batch_size=1):
        data = self.data[self.i, self.i+batch_size]
        i += batch_size
        return data

This is just an example. However, if a fixed part of your path designates "the source", then you have left yourself open to scale this indefinitely.

score 0 · Answer 2 · answered Aug 08 '18 at 06:06

Add /db to the path of the endpoint ex: POST models//db. The problem with this approach is that (2x) endpoints are generated for each model.

Inevitable. DRY out common code to sub-methods.

Add parameter db as boolean to each request. The problem with such approach is that it adds additional overhead for each request just to check which approach. Also, make the code less readable.

There won't be any additional overhead (that's what your underlying framework does to match a URL to a function/method anyway). However, these are 2 separate functionalities, I would keep them separate, so I would prefer the first approach.

Global variable set for each requester when signed for the API token. The problem is that you restricted the requester for 1 mode which is not convenient.

Yikes! unless you provide a UI letting a user to select his preference and apply it globally (I don't think any UX will agree to that)

That being said, the api design should be driven by questioning who is mastering (or owning) the data. If it's the application and user already knows the ID of that entity, then you shouldn't be asking the data from the user.

If it's the user, and then if it won't fit in a POST body, then I would say, a real-time API may not be the right solution, think about message queues/pub-sub based systems.

If you need a hybrid solution as you asked in the question, then, I would prefer the 1st approach.

API design for machine learning models

2 Answers2