14

I am going to store a huge amount of matrix data in a mysqlDB what is the most efficient way to store and access the data?

The efficiency is most important when getting the data, the table will not be updated regularly.

The matrix is about 100.000 times 1000 (probably larger in the future)


id1
value
value_id1
id1
value
value_id2
id2
value
value_id1
id2
value
value_id2
.
.
.
id 100.000
value
value_id1000

vs
     value_id1, value_id2, value_id3 ... id 1000
id1  value      value      value
id2  value      value      value
id3  value      value      value
.
.
.
id 100.000

When the data is huge what is most efficient, a short call (mysql query) or to have the data stored as a matrix? The data is used regularly so it must be efficient to fetch data.

Community
  • 1
  • 1
david
  • 141
  • 1
  • 1
  • 4
  • Is it a sparse matrix? Does it make sense in the application to subdivide it? – spraff Sep 19 '11 at 13:20
  • Is the horizontal dimension of the matrix going to stay consistent? Or is it likely to shrink or grow? – dmcnelis Sep 19 '11 at 13:22
  • The horiontal dimension will probably be consistent, it might be changed in an "upgrade" but not regularly! – david Sep 19 '11 at 13:27
  • Subdivision is not a good option for the application, and it is not sparse all colums and rows are filled. – david Sep 19 '11 at 13:28
  • how much inserting and updating will you do? Does your application need to update individual elements, or whole rows (horizontal, or shorter dimension) of the matrix at a time? Does it need to update whole columns (longer dimension)? – O. Jones Sep 19 '11 at 13:59
  • Perhaps pedantic, but could you define "efficient"? Time, size, optimize for read, insert, update? – Neville Kuyt Sep 19 '11 at 14:45
  • The insert and update is allowed to take time it is not updated regularly, the efficiency is most important when getting the data! I will probably only add columns to the table, but as above the time of the update does not have to be concerned in my case! – david Sep 19 '11 at 15:23
  • What will be the typical use? Will it be subselects on id, or subselects on value_id ? If so: are these subsets "ordered" (ranges) Will a typical query retrieve all data ? 10 percent ? only 1 percent ? Only one cell or one row or one column ? – wildplasser Sep 19 '11 at 15:36
  • Is the size of the matrix 100,000 x 1000 or 10,000,000 x 1000 ? Your statement says the former but your example seems to indicate the latter. It is important because the two problems require different approaches. – srivani Sep 19 '11 at 18:23
  • The typical use is most likely to be hits between 1 to 10 percent of the data. The matrix is 100.000 time 1000! The 10.000.000 was supposed to be the total number of rows in example 1. – david Sep 20 '11 at 06:55
  • Did you find a good solution for this problem? I am facing a similar challenge. I tried MySQL the way @Tae-Sung Shin suggested, but it appears to be very slow. My matrix is about 10k x 20k – zyxue Mar 14 '18 at 20:54

2 Answers2

11

Since you said you want efficiency in fetching, I would use following table format

 Column Row Value 
      1   1   1.2
      2   1   2.3
      ...

Using the format and indexing on column and row of the matrix, you can fetch any data part as fast as you want.

Tae-Sung Shin
  • 20,215
  • 33
  • 138
  • 240
4

There are a couple relevant questions here:

The answers for dense matrices seem to boil down to a normalized table with columns for column, row, and value, as suggested by Taesung above, or doing something like storing individual rows from your original matrix as blobs.

HDF5 looks to be made for this sort of thing. It would be great if someone with experience could comment further.

Community
  • 1
  • 1
cbare
  • 12,060
  • 8
  • 56
  • 63