Architectural issue with Tomcat cluster environment

Question

I am working on project in which we have an authentication mechanism. We are following the below steps in the authentication mechanism.

The user opens a browser and enter his/her email in a text box and click the login button.
The request goes to a server. We generate a random string (for example, 123456) and send a notification to the user's Android/iPhone and makes the the current thread wait with the help of the wait() method.
The user enters a password on his/her phone and clicks the submit button on his/her phone.
Once the user clicks the submit button, we are making a webservice hit the server and passing the previously generated string (for example, 123456) and password.
If the password is correct against the previously entered email, we call the notify() method to the previously waiting thread and send success as the response and the user gets entered into our system.
If the password is incorrect against the previously entered email, we call the notify() method to the previously waiting thread and send failed as the response and display an invalid credential message to the user.

Everything is working fine, but recently we moved to a clustered environment. We found that some threads are not notified even after replied by the user and for an unlimited waiting time.

For the server, we are using Tomcat 5.5, and we are following The Apache Tomcat 5.5 Servlet/JSP Container for making tomcat cluster environment.

Answer :: Possible problem and solution

The possible problem is the multiple JVMs in a clustered environment. Now we are also sending the clustered Tomcat URL to the user Android application along with generated string.

And when the user clicks on the reply button, we are sending the generated string along with the clustered Tomcat URL so in this case both requests are going to the same JVM, and it works fine.

But I am wondering if there is a single solution for the above issue.

There is a problem in this solution. What happens if the clustered Tomcat crashes? The load balancer will send a request to the second clustered Tomcat and again the same problem will arise.

This seem to me like a netbanking solution when the user receives an additional password (OTP) and using his regular password and the OTP can log in. Why is it necessary to login via the phone? The click in the email can redirect him to a login page where he can type in his regular password and the OTP which will arrive meantime to his phone. This way no need for wait/notify. — András Kerekes, Dec 27 '12 at 11:54
@Andras Kerekes: It was presumptions that login can be done by password, by face authentication, by voice authentication, by fingerprint authentication that can be done on phone and not on web-app. so in this way our authentication system will be more mature and there is no password kind of thing over the net so we are more safe. and in future we plan to make authentication system for third parties as a plugin. — Rais Alam, Dec 27 '12 at 12:07
In this case the login page where he entered his email address could periodically query the server if the login was successful/failed via the phone. I don't think it is a good idea to keep the app server worker threads in wait state. In case of many users the application can become unresponsive due to lack of worker threads. — András Kerekes, Dec 27 '12 at 12:24

score 9 · Accepted Answer · edited May 23 '17 at 12:01

The underlying reason for your problems is that Java EE was designed to work in a different way - attempting to block/wait on a service thread is one of the important no-no's. I'll give the reason for this first, and how to solve the issue after that.

Java EE (both the web and EJB tier) is designed to be able to scale to very large size (hundreds of computers in a cluster). However, in order to do that, the designers had to make the following assumptions, which are specific limitations on how to code:

Transactions are:
1. Short lived (eg don't block or wait for periods greater than a second or so)
2. Independent of each other (eg no communication between threads)
3. For EJBs, managed by the container
All user state is maintained in specific data storage containers, including:
1. A data store accessed through, eg, JDBC. You can use a traditional SQL database or a NoSQL backend
2. Stateful session beans, if you use EJBs. Think of these as Java Bean that persists its fields to a database. Stateful session beans are managed by the container
3. Web session This is a key-value store (kinda like a NoSQL database but without the scale or search capabilities) that persists data for a specific user over their session. It's managed by the Java EE container and has the following properties:
  1. It will automatically relocate if the node crashes in a cluster
  2. Users can have more than one current web session (i.e. on two different browsers)
  3. Web sessions end when the user ends their session by logging out, or when the session is inactive for longer than the configurable timeout.
  4. All values that are stored must be serializable for them to be persisted or transfered between nodes in a cluster.

If we follow those rules, the Java EE container can successfully manage a cluster, including shutting down nodes, starting new ones and migrating user sessions, without any specific developer code. Developers write the graphical interface and the business logic - all the 'plumbing' is managed by configurable container features.

Also, at run time, the Java EE container can be monitored and managed by some pretty sophisticated software that can trace application performance and behavioural issues on a live system.

< snark >Well, that was the theory. Practice suggests there are pretty important limitations that were missed, which lead to AOSP and code injection techniques, but that's another story < /snark >

[There are many discussions around the 'net on this. One which focuses on EJBs is here: Why is spawning threads in Java EE container discouraged? Exactly the same is true for web containers such as Tomcat]

Sorry for the essay - but this is important to your problem. Because of the limitations on threads, you should not block on the web request waiting for another, later request.

Another problem with the current design is what should happen if the user becomes disconnected from the network, runs out of power, or simply decides to give up? Presumably you will time out, but after how long? Just too soon for some customers, perhaps, which will cause satisfaction problems. If the timeout is too long, you could end up blocking all worker threads in Tomcat and the server will freeze. This opens your organisation up for a denial of service attack.

EDIT : Improved suggestions after a more detailed description of the algorithm was published.

Notwithstanding the discussion above on the bad practice of blocking a web worker thread and also the possible denial of service, it's clear that the user is presented with a small time window in which to react to the the notification on the Android phone, and this can be kept reasonably small to enhance security. This time window can also be kept below Tomcat's timeout for responses as well. So the thread blocking approach could be used.

There are two ways this problem can be resolved:

Change the focus of the solution to the client end - polling the server using Javascript on the browser
Communication between nodes in the cluster allowing the node receiving the authorization response from the Android App to unblock the node blocking the servlet's response.

For approach 1, the browser polls the server via Javascript with an AJAX call to a web service on Tomcat; the AJAX call returns True if the Android app authenticated. Advantage: client side, minimal implementation on the server, no thread blocking on the server. Disadvantages: During the waiting period, you have to make frequent calls (maybe one a second - the user will not notice this latency) which amounts to a lot of calls and some additional load on the server.

For approach 2, there is again choice:

Block the thread with an Object.wait() optionally storing the node ID, IP or other identifier in a shared data store: If so, the node receiving the Android app authorization needs to:
1. Either find the node that is currently blocking or broadcast to all nodes in the cluster
2. For each node in 1. above, send a message that identifies the user session to unblock. The message could be sent via:
  1. Have an internal-only servlet on each node - this is called by the servlet performing the Android app authorization. The internal servlet will call Object.notify on the correct thread
  2. Use a JMS pub-sub message queue to broadcast to all members of the cluster. Each node is a subscriber that, on receipt of a notification will call Object.notify() on the correct thread.
Poll a data store until the thread is authorized to continue: In this case, all the Android app needs to do is save the state in a SQL DB

I really appreciate your efforts. After reading the answer one thing is clear the design is not correct and my server will freeze. Please write me some solution to above question. I am going to modify my question and writing the steps we are following in our application. Probably you can get more information about exact use-case. — Rais Alam, Dec 27 '12 at 11:28
Great - got the feedback.It's midnight here now so I'm not in the best state to answer. I'll do so hopefully in the next 12 hours. — Andrew Alcock, Dec 27 '12 at 15:51
Added the concept of an in-memory cross-jvm cache such as Terracotta - thanks Luigi R. Viggiano — Andrew Alcock, Dec 28 '12 at 13:04
Really helpful, I have suggested both the approach to my senior manager let see what the decision comes from senior manager or architecture. We are very thankful to you and the stackoverflow community. — Rais Alam, Dec 31 '12 at 03:42
And yes a +100 to you, Thanks @Andrew Alcock again, Happy new year in advance. — Rais Alam, Dec 31 '12 at 03:43

score 1 · Answer 2 · answered Dec 23 '12 at 12:03

Using wait/notify can be tricky. Remember that any thread can be suspended at any time. So it's possible for notify to be called before wait, in which case wait will then block for ever.

I wouldn't expect this in your case, as you have user interaction involved. But for the type of synchronisation you are doing, try using a Semaphore. Create a Semaphore with 0 (zero) quantity. The waiting thread calls acquire() and it will block until another thread calls release().

Using Semaphore in this way is much more robust that wait/notify for the task you described.

score 1 · Answer 3 · edited Jan 19 '13 at 15:26

1

After analysing your question, I came to the conclusion that the exact problem is of multiple JVMs in a clustered environment.

edited Jan 19 '13 at 15:26

Peter Mortensen

30,738
21
105
131

answered Dec 23 '12 at 12:08

score 1 · Answer 4 · edited Jan 19 '13 at 15:28

1

The exact problem is because of the cluster environment. Both requests are not going to the same JVM. But we know that a normal/simple notify works on the same JVM when the previous thread is waiting.

You should try to execute both requests (first request, second request when the user replies from an Android application).

edited Jan 19 '13 at 15:28

Peter Mortensen

30,738
21
105
131

answered Dec 23 '12 at 13:17

pd40 · Answer 5 · 2012-12-27T11:05:26.767

1

Your clustered deployment means that any node in the cluster could receive any response.

Using wait/notify using threads for a web app risks accumulating a lot of threads that may not be notified which could leak memory or create a lot of blocked threads. This could eventually affect the reliability of your server.

A more robust solution would be to send the request to the android app and store the current state of the users request for later processing and complete the HTTP request. To store the state you could consider:

A database that all tomcat nodes connect to
A java cache solution that will work across tomcat nodes like hazelcast

This state would be visible to all nodes in your tomcat cluster.

When the reply from the android app arrives on a different node, restore the state of what your thread was doing and continue processing on that node.

If the UI of the application is waiting on a response from the server, you might consider using an ajax request to poll for the response state from the server. The node processing the android app response does not need to be the same one handling UI requests.

edited Dec 27 '12 at 11:05

answered Dec 26 '12 at 12:00

pd40

3,187
3
20
29

Thanks @pd40 for reply. We designed application using ajax and database approach but this was rejected by client. Saying "lots of unnecessary hit to database as well as server and the response will not be real time". Then we come to wait and notify solution, this approach is real time and no unnecessary hit to server as well as database. But problem is load on server increases as we are waiting threads on server for 40 seconds and secondly multiple JVM issue. – Rais Alam Dec 26 '12 at 14:12
I don't clearly understand your explanation. With threads waiting the client still need to hit server, doesn't it? So how else does your customer imagine it working without touching the servers? – András Kerekes Dec 27 '12 at 08:34
The threads are not waiting in this solution. The state is persisted until the android app responds. Any UI waiting on the android app would poll for a response. – pd40 Dec 27 '12 at 10:53
@pd40 For android app we are using google cloud messaging mechanism(GCM) similarly for iPhone also. So we have push mechanism for sending notification to user phone and not pull mechanism. – Rais Alam Dec 27 '12 at 11:41
I have modified my question and added more steps and details so that the use case and requirement can be easily understand by community. – Rais Alam Dec 27 '12 at 11:42

score 1 · Answer 6 · answered Dec 27 '12 at 05:38

Consider using an in-memory grid so that the instances in the cluster can share state. We used Hazelcast to share data between instances so in case a response reaches a different instance it still can handle it.

E.g. you could use distributed countdown latch with value of 1 to set the thread waiting after sending the message, and when the response arrives from the client to a separate instance it can decrease, that instance can decrease the latch to 0 letting to run the first thread.

score 1 · Answer 7 · answered Dec 27 '12 at 21:16

Using Thread.wait in a web service environment is a colossal mistake. Instead, maintain a database of user/token pairs and expire them at intervals.

If you want a cluster, then use a database that is clusterable. I would recommend something like memcached since it's in-memory (and fast) and low on overhead (key/value pairs are dead simple, so you don't need RDBMS, etc.). memcached handles expiration of tokens for you already, so it seems like a perfect fit.

I think the username -> token -> password strategy is unnecessary, especially because you have two different components sharing the same 2-factor authentication responsibility. I think you can further reduce your complexity, reduce confusion for your users, and save yourself some money in SMS-send fees.

The interaction with your web service is simple:

User logs into your website using username + password
If primary authentication (username/password) is successful, generate a token and insert userid=token into memcached
Send the token to the user's phone
Present "enter token" page to the user
User receives token via phone and enters it into the form
Fetch the token value from memcached based upon the user's id. If it matches, expire the token in memcached and consider the second-factor successful
Tokens will auto-expire after whatever amount of time you want to set in memcached

There are no threading problems with the above solution and it will scale across as many JVMs as you need to support your own software.

Thanks for effort. User can only enter his/her email from website there is no password on webpage. and when we send notification to user android/iPhone then we match password by many ways like face authentication, voice authentication, fingerprint authentication or simply text password. But the term password or matching is only on user's phone. This the beauty or feature of our application "password less authentication on web app." — Rais Alam, Dec 28 '12 at 04:06

score 1 · Answer 8 · edited May 23 '17 at 12:01

1

I'm afraid, but threads cannot migrate over classic Java EE clusters.

You have to rethink your architecture to implement the wait/notify differently (connection-less).

Or, you may give it a try with terracotta.org. It looks like this allows to cluster an entire JVM process over multiple machines. Maybe it's your only solution.

Read a quick introduction in Introduction to OpenTerracotta.

edited May 23 '17 at 12:01

Community

1
1

answered Dec 28 '12 at 01:12

Luigi R. Viggiano

8,659
7
53
66

score 0 · Answer 9 · edited Jan 19 '13 at 15:27

0

I guess the problem is, your first thread sends a notification to the user's Android application in JVM 1 and when the user reply back, the control goes to JVM 2. And that's the main problem.

Somehow, both threads can access the same JVM to apply wait and notify logic.

edited Jan 19 '13 at 15:27

Peter Mortensen

30,738
21
105
131

answered Dec 23 '12 at 12:18

score 0 · Answer 10 · edited Jan 19 '13 at 15:30

0

Solution:

Create a single point of contact for all waiting threads. Hence in a clustered environment, all the threads will wait on a third JVM (single point of contact), so in this way all the requests (any clustered Tomcat) will contact the same JVM for waiting and notify logic and hence no thread will wait for an unlimited time. If there is a reply, then the thread will be notified if the same object has waited and is being notified the second time.

edited Jan 19 '13 at 15:30

Peter Mortensen

30,738
21
105
131

answered Dec 23 '12 at 13:29

It is not good practice to spawn threads inside a JEE container such as Tomcat - that is what the JEE container is for. Those extra threads are not controlled by the container, can cause the container to fail to shut down, will not be restarted by the container and cannot be moved to other members of the cluster. Use other JEE mechanisms instead: session persistence or a data store. – Andrew Alcock Dec 27 '12 at 08:39

Architectural issue with Tomcat cluster environment

10 Answers10