Data allocation in the distributed databases

Question

How to optimize a data allocation in the distributed database?

Are there any software products for solving this problem?

For example:

There are some number of connected servers for the distributed database. Each server simultaneously is a client of this database.

The database has many tables.

We have statistic of queries from each client to the particular table.

There is some price of the data storage for each server. There is some price of transfer, known for each pair of the server and the client.

Objective: To allocate all tables (or parts of tables) on servers in the best possible way.

To solve this problem we can apply a variety of heuristic algorithms: genetic algorithms, evolution strategies, ant algorithms, etc.

But I could not find any ready software tools that would have implemented these algorithms.

Are there any tools to solve this problem for distributed databases (Oracle or others)?

Does anybody care about it?

And maybe somebody has examples of systems with a query statistic with the distributed database that have been optimized in this way?

Thanks!

What do you mean by "allocate all tables on servers"? Are you asking about how to distribute the data on storage? Or are you asking about load balancing for queries? — Jeffrey Kemp, Nov 10 '10 at 07:26
I mean data distribution. For example, result of distribution: Table a and c on server1, Table b on server2 — Artem Frolov, Nov 10 '10 at 11:32
If all the clients go through the same Oracle instance, then what difference does it make where the data is physically stored? — kurosch, Dec 16 '10 at 22:58
kurosch, cost of storage or/and transfer data from server A may be greater than cost of storage or/and transfer from server B. So the aim is - to find the optimal allocation for each part of data — Artem Frolov, Dec 17 '10 at 12:52

score 0 · Answer 1 · answered May 07 '13 at 23:18

An example of a distributed database that solves this problem is Clustrix, which is the only database that has independent index distribution. Clustrix is a database built from the ground up to be a distributed MySQL replacement.

More on how Clustrix does data distribution and the distributed evaluation model

score 0 · Answer 2 · answered Jan 17 '11 at 22:02

I've looked for something similar, but the sad truth is that there aren't off-the-shelf tools for doing this kind of analysis in regards to databases. You can find a lot of information, though, with various research projects, university papers, and so on.

As an alternative, this could be modelled using off-the-shelf mathematical tools to optimize the data localization/correlation to specific clients.

score 0 · Answer 3 · answered Apr 12 '11 at 19:52

I think it is a lot easier to just store the data in a centralized database and configure a cache for the various locations. Because the different locations are not likely able to be in the same grid, the cache configuration should be a synchronous cache because in an async cache solution the order of updates in the database might not be the order in which the updates were applied. The cache will reduce lots of query network traffic and improve performance for the remote locations, compared to when they should access the database directly. The Oracle In-Memory Cache Database Option could be worth investigating. Works for 10.2.0.4 databases and above, using the 11.2.1.8 version of what was formerly called TimesTen. A great option. The algorithms you asked for, are effectively caching algorithm. Make sure that often used data is close to the consumer, at the best possible price. If you can spend more on memory, more data fits in. The LRU will take care for cleaning of less often used data from the cache.

Data allocation in the distributed databases

3 Answers3