REST API: Does validation on identifiers break encapsulation?

Question

I figured I'd post here to get some ideas/feedback on something I've come up against recently. The API I've developed has validation on an identifier that's passed through as a path parameter: e.g. /resource/resource_identifier

There are some specific business rules as to what makes an idenfier valid and my API has validation which enforces these rules and returns a 400 when that's violated.

Now the reason I'm writing this is that I've been doing this sort of thing in every REST (ish) API I've ever written. It's kind of ingrained in me now but ecently I've been told that this is 'bad' and that it breaks encapsulation. Furthermore, it does this by forcing a consumer to have knowledge about the format of an identifier. I'm told that I should be returning a 404 instead and simply accept anything as an idenfier.

We've had some pretty heated debates about this and what encapsulation actually means in the context of REST. I've found numerous definitions but they aren't specific. As with any REST contention it's hard to substantiate an argument for either.

If StackOverflow would allow me, I'd like to try and gain a concensus on this and why APIs like Spotify for example, use 400 in this scenario.

score 2 · Answer 1 · answered Oct 31 '22 at 15:15

I've been doing this sort of thing in every REST (ish) API I've ever written. It's kind of ingrained in me now but recently I've been told that this is 'bad'

In the context of HTTP, it is an "anti-pattern", yes.

I'm told that I should be returning a 404 instead

And that is the right pattern when you want the advantages of responding like a general purpose web server.

Here's the point: if you want general purpose components in the HTTP application to be able to do sensible things with your response messages, then you need to provide them with the appropriate meta data.

In the case of a target resource identifier that satisfies the request-target production rules defined in RFC 9112 but is otherwise unsatisfactory; you can choose any response semantics you want (400? 403? 404? 499? 200?).

But if you choose 404, then general purpose components will know that the response is an error that can be re-used for other requests (under appropriate conditions - see RFC 9111).

why APIs like Spotify for example, use 400 in this scenario.

Remember: engineering is about trade offs.

The benefits of caching may not outweigh more cost effective request processing, or more efficient incident analysis, or ....

It's also possible that it's just habit - it's done that way because that's the way that they have always done it; or because they were taught it as a "best practice", or whatever. One of the engineering trade offs we need to consider is whether or not to invest in analyzing a trade off!

An imperfect system that ships earns more market share than a perfect solution that doesn't.

Roman Vottner · Answer 2 · 2022-10-31T14:59:04.140

While it may sound natural to expose the resource internal ID as ID used in the URI, remember that the whole URI itself is the identifier of a resource and not only the last bit of the URI. Clients are usually also not interested in the characters that form the URI (or at least they shouldn't care about it) but only in the state they receive upon requesting that from the API/server.

Further, if you think long-term, which should be the reason why you want to build your design on top of a REST architecture, is there a chance that the internal identifier of a resource could ever change? If so, introducing an indirection could make more sense then i.e. by using UUIDs instead of product IDs in the URI and then have a further table/collection to perform a mapping from UUID to domain object ID. Think of a resource that exposes some data of a product. It may sound like a good idea to use the product ID at the end of the URI as they identify the product in your domain model clearly. But what happens if you company undergoes a merge with an other company that happens to have an overlap on product but then uses different identifiers than you? I've seen such cases in reality, unfortunately, and almost all of them wanted to avoid change for their clients and so had to support multiple URIs for the same products in the end.

This is exactly why Mike Amundsen said

... your data model is not your object model is not your resource model ... (Source)

REST is full of such indirection mechanisms to allow such systems to avoid coupling. I.e. besides above mentioned mechanism, you also have link-relations to allow servers to switch URIs when needed while clients can still lookup the URI via the exposed relation name, or its focus on negotiated media types and its representation formats rather than forcing clients to speak their API-specific RPC-like, plain-JSON slang.

Jim Webber further coined the term domain application protocol to describe that HTTP is an application protocol for exchanging documents and any business rules we infer are just side effects of the actual document management performed by HTTP. So all we do in "REST" is basically to send documents back and forth and infer some business logic to act upon receiving certain documents.

In regards to encapsulation, this isn't the scope of REST nor HTTP. What data you return depends on your business needs and/or on the capabilities of the representation formats exchanged. If a certain media-type isn't able to express a certain capability, providing such data to clients might not make much sense.

In general, I'd would recommend not to use domain internal IDs as part of URIs for the above mentioned reasons. Usually that information should be part of the exchanged payload to give users/customers the option to refer to that resources on other channels like e/mail, telephone, ... Of course, that depends on the resource and its purpose at hand. As a user I'd prefer to refer to myself with my full name rather than some internal user- or customer ID or the like.

edit: sorry, missed the validation aspect ...

If you expect user/client input on the server/API side, you should always validate the data before starting to process it. Usually though, URIs are provided by the server and might only trigger business activities if the URI requested matches one of your defined rules. In general, most frameworks will respond with 400 Bad Request responses when they couldn't map the URI to a concrete action, giving the client a chance to correct its mistake and reissue the updated request. As URIs shouldn't be generated or altered by clients anyways, validating such parameters might be unnecessary overhead unless they might introduce security risks. Here it might be a better approach then to toughen-up the mapping rules of URIs to actions then and let those frameworks respond with a 400 message when clients use stuff they aren't supposed to.

score -1 · Answer 3 · answered Nov 02 '22 at 23:12

Encapsulation makes sense when we want to hide data and implementation behind an interface. Here we want to expose the structure of the data, because it is for communication, not for storage and the service certainly needs this communication in order to function. Validation of data is a very basic concept, because it makes the service reliable and because is protects against hacking attempts. The id here is a parameter and checking its structure is just parameter validation, which should return 400 if failed. So this is not restricted to the body of the request, the problem can be anywhere in the HTTP message as you can read below. Another argument against 404 that the requested resource cannot possibly exist, because we are talking about a malformed id and so a malformed URI. It is very important to validate every user input, because a malformed parameter can be used for injections e.g. for SQL injection if it is not validated.

The HyperText Transfer Protocol (HTTP) 400 Bad Request response status code indicates that the server cannot or will not process the request due to something that is perceived to be a client error (for example, malformed request syntax, invalid request message framing, or deceptive request routing).

vs

The HTTP 404 Not Found response status code indicates that the server cannot find the requested resource. Links that lead to a 404 page are often called broken or dead links and can be subject to link rot. A 404 status code only indicates that the resource is missing: not whether the absence is temporary or permanent. If a resource is permanently removed, use the 410 (Gone) status instead.

In the case of REST we describe the interface using the HTTP protocol, URI standard, MIME types, etc. instead of the actual programming language, because they are language independent standards. As of your specific case it would be nice to check the uniform interface constraints including the HATEOAS constraint, because if your service makes the URIs as it should, then it is clear that a malformed id is something malicious. As of Spotify and other APIs, 99% of them are not REST APIs, maybe REST-ish. Read the Fielding dissertation and standards instead of trying to figure it out based on SO answers and examples. So this a classic RTFM situation.

In the context of REST a very simple example of data hiding is storing a number something like:

PUT /x {"value": "111"} "content-type:application/vnd.example.binary+json"
GET /x "accept:application/vnd.example.decimal+json" -> {"value": 7}

Here we don't expose how we store the data. We just send the binary and decimal representations of it. This is called data hiding. In the case of id it does not make sense to have an external id and convert it to an internal id, it is why you use the same in your database, but it is fine to check if its structure is valid. Normally you validate it and convert it into a DTO.

Implementation hiding is more complicated in this context, it is sort of avoiding micromanagement with the service and rather implement new features if it happens frequently. It might involve consumer surveys about what features they need and checking logs and figuring out why certain consumers send way too many messages and how to merge them into a single one. For example we have a math service:

PUT /x 7
PUT /y 8
PUT /z 9
PUT /s 0
PATCH /s {"add": "x"}
PATCH /s {"add": "y"}
PATCH /s {"add": "z"}
GET /s -> 24

vs

POST /expression {"sum": [7,8,9]} -> 24

If you want to translate between structured programming, OOP and REST, then it is something like this:

Number countCartTotal(CartId cartId);

<=>

interface iCart {
    Number countTotal();
}

<=>

GET api/cart/{cartid}/total -> {total}

So an endpoint represents an exposed operation something like verbNoun(details) e.g. countCartTotal(cartId), which you can split into verb=countTotal, noun=cart, details=cartId and build the URI from it. The verb must be transformed into a HTTP method. In this case using GET makes the most sense, because we need data instead of sending data. The rest of the verb must be transformed into a noun, so countTotal -> GET totalCount. Then you can merge the two nouns: totalCount + cart -> cartTotal. Then you can build an URI template based on the resulting noun and the details: cartTotal + cartId -> cart/{cartid}/total and you are done with the endpoint design GET {root}/cart/{cartid}/total. Now you can bind it to the countCartTotal(cartId) or to the repo.resource(iCart, cartId).countTotal().

So I think if the structure of the id does not change, then you can even add it to the API documentation if you want to. Though it is not necessary to do so.

From security perspective you can return 404 if the only possible reason to send such a request is a hacking attempt, so the hacker won't know for certain why it failed and you don't expose details of the protection. In this situation it would be overthinking the problem, but in certain scenarios it makes sense e.g. where the API can leak data. For example when you send a password reset link, then a web application usually asks for an email address and most of them send an error message if it is not registered. This can be used to check if somebody is registered on the site, so better to hide this kind of errors. I guess in your case the id is not something sensitive and if you have proper access control, then even if a hacker knows the id, they cannot do much with that information.

Another possible aspect is something like what if the structure of the id changes. Well we write a different validation code, which allows only the new structure or maybe both structures and make a new version of the API with v2/api and v2/docs root and documentation URIs.

So I fully support your point of view and I think the other developer you mentioned does not even understand OOP and encapsulation, not to mention webservices and REST APIs.

REST API: Does validation on identifiers break encapsulation?

3 Answers3