Data consistency in XA transactions

Question

Suppose we have a database (e.g. Oracle) and a JMS provider (e.g. HornetQ) participating in an XA transaction. A message is sent to a JMS queue and some data are persisted in the database in the same distributed transaction. After the transaction is committed, a message consumer will read the persisted data and process them in a separate transaction.

Regarding the first XA transaction, the following sequence of events may be executed by the transaction manager (e.g. JBoss)

prepare (HornetQ)
prepare (Oracle)
commit (HornetQ)
commit (Oracle)

What happens if the message consumer starts reading the data after commit is completed in HornetQ, but is still being executed in Oracle? Will the message consumer read stale data?

The question can be generalized to any kind of multiple resources participating in XA transactions, i.e. is there a possibility for a small time window (when commit phases are executed) in which a reader from another concurrent transaction can get an inconsistent state (by reading committed data from one resource and stale data from another one)?

I would say that the only way for transactional resources to prevent this is to block all readers of affected data once the prepare phase is completed until the commit is issued. This way the example message consumer mentioned above would block until data is committed in the database.

Good question, it is the main problem with JTA IMO, it is not properly documented even the spec is much too light to describe (as it should) such a complex mechanism. — Nicolas Filotto, Jun 09 '16 at 13:56
Moreover this is not the worse use case, think about cases where you have a failure on commit when the XAResource implementor doesn't cover recover. — Nicolas Filotto, Jun 09 '16 at 13:58
Detailed specification too long to put it in answer, but may be found in Oracle White Paper ["XA and Oracle controlled Distributed Transactions"](http://www.oracle.com/technetwork/products/clustering/overview/distributed-transactions-and-xa-163941.pdf) on page 12 in chapter "Distributed Transactions and Database Locking". — ThinkJet, Jun 10 '16 at 23:12
I set up exactly the same kind of multi resource transaction ( 2 Datasources, one JMS Queue), I never noticed such a behavior, Both HornetQ and database instances are remote from the transactionManager perspective, it doesn't mean that it's not possible but it stay highly unprobable imho (maybe with inVM netty acceptors for JMS transport and a huge latency in DB access) — Gab, Jul 04 '16 at 12:55

Justin · Accepted Answer · 2016-06-29T07:28:28.370

7

Unfortunately XA transactions don't support consistency. When mapped to CAP theorem XA solves Availability and Partition tolerence across multiple datastores. In doing so it must sacrifice on Consistency. When using XA you have to embrace eventual consistency.

In any event creating systems that are CP or AP is hard enough that regardless of your datastore or transactional model you will face this problem.

edited Jun 29 '16 at 07:28

answered Jun 29 '16 at 06:53

Justin

4,097
1
22
31

Hash · Answer 2 · 2016-06-14T06:49:48.607

1

I have a some experience with a bit of different environment based on Weblogic JMS and Oracle 11g. In this answer I suppose that it is working exactly the same. I hope my answer will help you.

In our case there was a "distant" system which was obligatory to notify based on the different events happend inside the local system. The other system also red into our database so the use-case seems almost identical to your problem. The sequence of the events was exacly the same as yours. On the test systems there was not a single faulire. Everyone thought that it will work but some of us doubted if it is the correct solution. As the software hit production some of the BPM processes run unpredictably. So a simple answer to your question: yes it is possible and everyone should be aware it.

Our solution (in my opinion) was not a well planned one, but we recognised that the little time window between the two commit is braking the system, so we added some "delay" to the queue (if I remember it was like 1-2 minutes). It was enough to finish the other commit and read consistent data. In my point of view it is not the best solution. It is not solving the syncronisation problem (what if an oracle transaction is longer than 1-2mins?).

Here is a great blog post that is worth to read and the last solution seems the best to me. We implemented it in an other system and it is working way better. Important to notice that you should limit the retries (re-reads) to prevent "stuck" threads. (With some error reporting.) With this restrictions I was not able to find better solution so far, so if anyone got some better option I am looking forward to hear it. :)

Edit: typos.

edited Jun 14 '16 at 06:49

answered Jun 13 '16 at 14:42

Hash

4,647
5
21
39

"what if an oracle transaction is longer than 1-2mins" well the delay we're talking about is the time to commit the transaction and not the whole transaction duration, some dozens of milliseconds maximum. Moreover JMS message are generally sent over the network which should take longer than just the second phase commit. According to me there is another issue in your design. – Gab Jul 04 '16 at 12:48
Could this be solved easier by acquiring a lock on the data being read from the message consumer (`select for update` for example), than with timeouts? With timeouts, either we can introduce a too large fixed latency for each message or increase the risk of reading stale data again if the timeout is too low. – Dragan Bozanovic Jul 05 '16 at 12:04
A simple lock is not a step forward as I see. At the read time there will not be represented the required data (the insert is not commited jet). The manual syncronisation I mentioned is (in my opinion) a way faster and reliable solution. I'll be back from holiday in 2 days if it is needed I could draw a flowchart. On phone my possibilities are pretty limited. – Hash Jul 05 '16 at 14:03
True, if the change is an insert statement. But if it is an update, then lock should work. – Dragan Bozanovic Jul 05 '16 at 14:49
It really depends. Some timestamp based locking strategi will abort the original transaction, insted of placing the second one in "waiting for lock". I really would like to know what your final solution will be. Please update us when you had the chanse to reproduce it. :) – Hash Jul 05 '16 at 17:31

score 0 · Answer 3 · answered Jun 17 '16 at 07:40

Yes. It is possible for an external system to receive and consume the messages you send before the DB actually commits, even if the transaction fails and rolls back afterwards.

For the last two years I've been doing maintenance and development of a distributed system using XA transactions with WebSphere MQ as JMS provider and Oracle 11g as backing DB.
One of its batch jobs would read offline messages from the DB, send them to JMS and mark them as sent in the DB -- all as part of the same XA transaction. If any of the messages or the DB failed, the transaction would be rolled back.

Some times, a message would be too large for JMS and cause the send() to fail and the whole transaction to rollback(), leaving the DB unchanged.
However, an external consumer was still receiving and processing every single message sent before the rollback. I knew because they would send me an email for every message processed, and I was getting lots of emails about messages that weren't marked as sent in the DB (because the transaction had been rolled back).

If this external system was to somehow SELECT COUNT(*) the number of messages sent by my system, it would read 0 messages sent, despite having already consumed hundreds of them.

So, yes it is possible for an external system to read stale data even when using XA transactions.

score 0 · Answer 4 · answered Jul 01 '16 at 13:22

0

I will insert a status field, so after the each step, if it was successful, the status will be updated, and the reader should check the status before performing an operation.

answered Jul 01 '16 at 13:22

Eugen Iluschin

1
1

Data consistency in XA transactions

4 Answers4