Scalable System Architecture/Design for Reading/Parsing Files

Question

Background: I am designing a software application that reads millions or much more files and either converts or just parses those files. Part of requirement is to build a scalable and distributed system so that reading and parsing can be scaled accordingly.

Basically, a minimally detailed list of filenames is one DB and Clients need to access the list to know which files need to be parsed/converted next. The files again are on another server/location. While most of the pieces are designed, one critical piece that needs a revisit is a design of feeding the file-names to different clients.

I have two options now:

Design a single service that sits next to DB and channelizes all requests to file names and feeds the clients. So in this case, Clients talk to the service(predefined protocol/format) and get the list.
Design Clients to talk directly to DB and implement synchronization/channelization within clients.

My only concern with first option is that, is that a scalable architecture/design? Has anyone dealt with such an circumstance in scalable architecture where one resource becomes a critical in scaling (In my case it could be One service feeding/servicing all clients)

Do the files reside in a database table or are they stored in the file system? — home, Mar 24 '12 at 06:17
@home: Only file names reside in DB. Files are located on different physical server elsewhere on same network. — santosh, Mar 24 '12 at 06:19

score 2 · Answer 1 · answered May 10 '12 at 05:09

I suggest using a distributed data grid like GigaSpaces (http://www.gigaspaces.com/datagrid) on top of your DB. This way you can partition your data across several machines and lower the contention on your DB - clients will read files to be processed from different instances of the data grid. Scalability is then possible by increasing the number of the data grid partitions as your load increases and decide how to partition your data among the data grid instances.

There are several possibilities to consider for making sure only one client reads a specific file to be processed, one of them could be by using the data grid's take operation (read & remove) which makes sure only one client 'takes' a file to be processed.

GigaSpaces also offers a great monitoring tool so you can monitor your load (liveness, statistics etc..)

score 0 · Accepted Answer · answered Mar 24 '12 at 09:09

I would like to suggest that you look at message queues such as Rabbit MQ(http://www.rabbitmq.com), Microsoft Message Queue (http://bit.ly/GMo4iI) and IBM Message Queue (http://bit.ly/GMo6qY), which already have a scaling architectures in place.

You can setup clients to request for messages from the queue and configure each message body to contain the details of the files to be processed. The client can then delete the message from the queue once the file has been processed.

You need to setup mechanisms to make sure the same files are not read at the same time etc, but this can be done at the queue level and you configure each client to look at specific queues or message priorities.

Messaging frameworks are good option for this scenario. I would also have to take a look at RabbitMQ and MSMQ which offers Publish/Subscribe method of handling clients. — santosh, Mar 24 '12 at 21:57

Scalable System Architecture/Design for Reading/Parsing Files

2 Answers2