There is a product-feature matrix. It has thousands rows (products) and hundreds features. It has binary values that show if the product has this feature or not. So it could be a table of 40 000 rows and 900 columns.
Product-feature matrix
pr f1 f2 f3 fn ...
01 0 1 1 1
02 0 0 0 0
03 1 0 1 0
04 1 0 1 0
.....
First, I have to find products that have a given set of features Q. E.g. Q=(f1=1, f5=1, f27=1). Simple said, find blue cars, hatchback, 3-doors.
Result 1
Given Q=(f1=1, f5=1, f27=1)
Relevant products: 03, 04, 08...
Second, and most important, I have to find how many products that have a set of features Q, also have a feature f_i (where i - 1..n). In other words, we are selecting rows that satisfy Q, and count how many 1 in each column (make SUM aggregation). E.g. how many blue cars, hatchback, 3-doors also has: diesel engine, gasoline engine, xenon lights.
Result 2
Given Q=(f1=1, f5=1, f27=1)
sum f2 = 943
sum f3 = 543
sum f4 = 7
sum f6 = 432
....
Of course it is possible to solve this task using an RDBMS but it's not so effective - in general case it will require a fullscan both for finding products and aggregation in each columns. At least I don't know how to build an effective b-tree index for this task. Oracle bitmap index could help, but I can't use Oracle.
Currently, we are using MySQL for this task, but it's not showing good results. Actually we are using integer representation (we grouped features and store integers in columns, not bool values) to reduce the amount of columns.
It's possible to treat this matrix as a sparse binary matrix. And it's not a big problem to store it completely in memory. And I'm wondering if it's possible to apply some algorithms to work with sparse matrices, vector space (SVD, matrix-vector multiplications and so on). But it probably helps in finding products that satisfy vector Q, not in aggregation. The problem lies more in time of aggregation, not in space.
Probably, it's possible to store matrix as a multi-linked list that would help to find products and make aggregation for each column.
Finally, the question is how to treat this task. What is the most effective algorithm to find products with given features and then count the products with additional features (aggregate by each column).