Gathering distributed data into central database

Question

I was assigned to update existing system of gathering data coming from points of sale and inserting it into central database. The one that is working now is based on FTP/SFTP transmission, where the information is sent once a day, usually at night. Unfortunately, because of unstable connection links (low quality 2G/3G modems), some of the files appear to be broken. With just a few shops connected that way everything was working smooth, but along with increasing number of shops, errors became more often. What is worse, the time needed to insert data into central database is taking up to 12 - 14h (including waiting for the data to be downloaded from all of the shops) and that cannot happen during the working day as it would block the process of creating sale reports and other activities with the database - so we are really tight with processing time here.

The idea my manager suggested is to send the data continuously, during the day. Data packages would be significantly smaller, so their transmission and insertion would be much faster, central server would contain actual (almost real time) data and night could be used for long running database activities like creating backups, rebuilding indexes etc.

After going through many websites, I found that:

using ASMX web service is now obsolete and WCF should be used instead
WCF with MSMQ or System Messaging could be used to safely transmit data, where I don't have to care that much about acknowledging delivery of data, consistency, nodes going offline etc.
according to http://blogs.msdn.com/b/motleyqueue/archive/2007/09/22/system-messaging-versus-wcf-queuing.aspx WCF queuing is better
there are also other technologies for implementing message queue, like RabbitMQ, ZeroMQ etc.

And that is where I become confused. With so many options, do you have any pros and cons of these technologies? We were using .NET with Windows Forms and SQL Server, but if it would be necessary, we could change to something more suitable. I am also a bit afraid of server efficiency. After some calculations, server would be receiving about 15 packages of data per second (peak). Is it much? I know there are many websites without serious server infrastructure, that handle hundreds of visitors online and still run smooth, but the website mainly uploads data to the client, and here we would download it from the client.

I also found somewhat similar SO question: Middleware to build data-gathering and monitoring for a distributed system where DDS was mentioned. What do you think about introducing some middleware servers that would cope with low quality links to points of sale, so the main server would not be clogged with 1KB/s transmission?

I'd be grateful with all your help. Thank you in advance!

Is the connection from the shops to the central server across the public internet? — tom redfern, May 07 '15 at 13:01

score 1 · Answer 1 · answered May 08 '15 at 13:30

Rabbitmq can easily cope with thousands of 1kb messages per second.

As your use case is not about processing real time data, I'd say you should combine few messages and send them as a batch. That would be good enough in order to spread load over the day.

As the motivation here is not to process the data in real time, then any transport layer would do the job. Even ftp/sftp. As rabbitmq will work fine here, it's not the typical use case for it.

As you mentioned that one of your concerns is slow/unreliable network, I'd suggest to compress the files before sending them, and on the receiving end, immediately verify their integrity. Rsync or similar will probably do great job in doing that.

tom redfern · Accepted Answer · 2015-06-10T20:14:57.823

From what I understand, you have basically two problems:

Potential for loss/corruption of call data
Database write performance

The potential for loss/corruption of call data is being caused by a lack of reliability in the transmission of data from client to service.

And it's not clear what is causing the database contention/performance issues, beyond a vague reference to high volumes, so this answer will be more geared towards solving the first problem.

You have correctly identified the need for reliable asynchronous communication transport as a way to address the reliability issues in your current setup.

Looking at MSMQ to deliver this is a valid first step. MSMQ provides reliable communication via a store and forward messaging semantic which comes out of the box and requires very little in the way of configuration.

Unfortunately, while suitable for your needs, MSMQ relies on 2 things:

A reliable network protocol, and
A client service running on both sending and receiving machine.

From your description above, I don't believe 1 exists (the internet is not a reliable network), and you might well struggle with 2 - MSMQ only ships with Windows Server or business/enterprise versions of Windows on the desktop.(*see below...)

As a possible solution to the network reliability problem, you could use a WCF or a RESTful endpoint (using Nancy or WebApi) to expose a service operation(s) exposed over HTTP, which would accept the incoming calls from the client machines. These technologies are quite different, so you'll need to make sure you're making the correct choice early on.

WCF supports WS-ReliableMessaging from the SOAP 1.2 specification out of the box, which allows for reliable web service calls over http, however it's very config-heavy and not generally a nice framework to work with.

REST much simpler than WCF in .Net, is very lightweight and easy to use. However, for reliable delivery you would have to expose some kind of GET operation (in addition to a POST to allow the client to send data) to be called (within a reasonable time-frame) to verify the data was committed. The client would have to implement some kind of retry semantic if the result of the GET "acknowledgement" was negative.

Despite requiring two operations rather than one for the WCF route, I would favour the REST approach. I've done plenty of both and find REST services way nicer to work with.

(*) That's not to say that MSMQ wouldn't work in your ultimate solution, just that it would not be used to address the transmission reliability issue. However it could still be used to address another of your problems, that of database write contention. If you were to queue incoming requests once they came into the server, then these could be processed by an "offline" process, which could then perform the required database operations in a reliable manner. This could be done by using MSMQ transactional queues.

In response to comments:

99% messages are passed from shop to main server, but if some change is needed (price correction, discounts etc.), that data has to be sent to shop.

This kind of changes things. Had I understood from the beginning that you had a bidirectional requirement, and seeing as how you have managed to establish msmq communication, I would have nudged you towards NServiceBus, which is a really, really cool wrapper around MSMQ. The reason I would have done this is that you appear to have both a one way, and a publish-subscribe requirement, which is supported really nicely by NServiceBus.

Thank you for your response, it explained a lot to me. I was not aware of MSMQ limitation (Windows Server 2008 is not supported) until I started to code the prototype application yesterday. So it looks like I have to implement my own protocol, based on HTTP requests with confirmation mechanism and queue to ensure everything was transmitted. — Mark, May 12 '15 at 08:48
@Mark - Wait! Windows Server 2008 definitely does support MSMQ. You need to install Message Queuing windows component — tom redfern, May 12 '15 at 09:38
Oh, you are right. I didn't see that option in "Add features" section because of different language translation... Still - MSMQ is just one of the options. I'll proceed to test RabbitMQ and give my impression here. Thanks! — Mark, May 12 '15 at 10:59
Okay, I managed to implement bi-directional MSMQ communication using WCF Service as consumer and WinForms App as producer. To be honest, it isn't exactly the way I thought it would work, because I still have to maintain "listening" service on shop... Maybe HTTP requests aren't that bad? They can respond to particular request and doesn't need that configuration step (installing MSMQ, opening ports etc) to get things going. — Mark, May 14 '15 at 09:00
Mark, just wondering, but why do you need bi-directional communication? I was under the impression that sending data from the shops to the server was one way only. If you're using queuing with transactional queues you don't need to check from the client if the data has been committed, you can just trust that it will be. — tom redfern, May 14 '15 at 09:04
I forgot to mention that, my bad :( I'm sorry for misleading you. In fact, 99% messages are passed from shop to main server, but if some change is needed (price correction, discounts etc.), that data has to be sent to shop. Maybe I need some kind of socket communication? Websockets? — Mark, May 14 '15 at 09:16
Thank you - we'll test NServiceBus as well and choose the best technology. I marked your answer as accepted, it really helped me a lot. — Mark, May 22 '15 at 12:32

Gathering distributed data into central database

2 Answers2