-1

The matrix has 50 million rows and 100k columns. The rows are mostly sparse, each row only have < 1k non-zero values. All the columns are also very sparse in this case. The values are all 32bit int. The most common operation on these data are:

  • get several rows based on index.
  • obtain index based on query like very simple combination col_a>val_a AND col_b>val_b OR col_c>val_c ... or more complex condition like select top 10 based on col_x + col_y + ... or even more complex with some weights applied to it like col_x*x + col_y*y ... We don't need to support condition based on more than 10 different columns in a single query.

Our frontend guys want me to put the sparse matrix in a database. I barely work with database. But I am aware the postgresql (Though we do not need to use postgresql) do not support more than 1.6k columns, sql do not support more than 1024 columns. So I wonder what is the best practice to do this? Any special database can hold this many columns per table? If there isn't one. How could I handle this in normal sql/postgresql database?

  • Should I segment the row into different tables (1k column per table, 100 tables)? How efficient it would be if we need a lot query like top N of col_x + col_y ... across different tables?
  • Maybe I can store each row as array in one column? But then I cannot use sparse column instead I may rely on the server side compression. Again how to efficiently query top N in this case?

My concern is I doubt the database index can help much with our queries except the most simple one col_a > val_a AND .... For the top N query, in theory, we may cache the top 100 data for each combination, but considering the number of possible combinations it is impossible.

The matrix will be read only in most time in our application. Does it even make sense to use database to hold it?

update

After searching with wide data, wide columns. I see Cassandra and HBase can handle millions columns. But they are no-sql and probably hard to setup. Spark in theory does not have column limit but it seems only efficient for long and slim data. So may be there is some database suite my particular needs.

Wang
  • 7,250
  • 4
  • 35
  • 66
  • This seems mostly like an opinion based question. https://stackoverflow.com/tour mentions that one should not ask opinion based questions. Do you have a specific problem you can state as what you'd like to solve? You mention that posgresql only supports 1.6k columns, and your problem states you have 100K columns... are you looking for a database that can manage 100K columns? – Kevon Nov 08 '22 at 21:54
  • either suggest a db can handle 100k columns efficiently (I doubt there is one). Or suggest a strategy to handle this in a normal sql/postgres db. This is not opinion based. The question is specific down to the form of query I want to use. @kevon Please do not just close whatever question you fail to understand. – Wang Nov 08 '22 at 22:12
  • If you think split it into multiple table is a common and good strategy, then please tell me how can I efficiently do the query with sql commands. Isn't that clear? I just try not to limit the way to solve this problem: query very large sparse matrix with database. – Wang Nov 08 '22 at 22:18
  • What does "*get several rows based on index.*" mean? Which "index"? The index of a specific value in the row? –  Nov 09 '22 at 06:44
  • Btw: I don't think "SQL" imposes any limit of the number of columns. –  Nov 09 '22 at 06:45
  • Depending on "how sparse" the rows are, maybe use a `jsonb` structure, where the key is the "index", e.g. `{"1": 42, "46": 4, "93765": 10}`. Finding rows would require JSON path expressions that in theory can be supported by an index. But I have no idea how efficient that will be. In terms of storage this would only be an improvement over an array if typically only very few elements have a value. –  Nov 09 '22 at 06:47
  • thanks @a_horse_with_no_name, as I described in the first paragraph `each row only have < 1k non-zero values` so >99% sparseness. – Wang Nov 09 '22 at 15:15
  • `SQL`: I mean the traditional RDB support sql query, such as mariadb, mysql ... – Wang Nov 09 '22 at 15:16
  • @a_horse_with_no_name `index` I mean the row number, row id, ranges, etc. a special column that has unique value per row. – Wang Nov 09 '22 at 15:19
  • There is no way you can model this in PostgreSQL to have all these queries be fast. Sorry. – Laurenz Albe Nov 09 '22 at 15:21

1 Answers1

0

My research concluded that no database supports 100K columns.

One way to solve this would be to have 100 tables with 1K columns each, all of which are linked together with a common key.

That being said... If you do your own research outside StackOverflow, you will likely find the same opinions I found, and hold myself, this is really dumb idea.

Kevon
  • 984
  • 8
  • 28