Improve performance of matrices/table aggregation and search

Question

There is a product-feature matrix. It has thousands rows (products) and hundreds features. It has binary values that show if the product has this feature or not. So it could be a table of 40 000 rows and 900 columns.

Product-feature matrix pr f1 f2 f3 fn ... 01 0 1 1 1 02 0 0 0 0 03 1 0 1 0 04 1 0 1 0 .....

First, I have to find products that have a given set of features Q. E.g. Q=(f1=1, f5=1, f27=1). Simple said, find blue cars, hatchback, 3-doors.

Result 1 Given Q=(f1=1, f5=1, f27=1) Relevant products: 03, 04, 08...

Second, and most important, I have to find how many products that have a set of features Q, also have a feature f_i (where i - 1..n). In other words, we are selecting rows that satisfy Q, and count how many 1 in each column (make SUM aggregation). E.g. how many blue cars, hatchback, 3-doors also has: diesel engine, gasoline engine, xenon lights.

Result 2 Given Q=(f1=1, f5=1, f27=1) sum f2 = 943 sum f3 = 543 sum f4 = 7 sum f6 = 432 ....

Of course it is possible to solve this task using an RDBMS but it's not so effective - in general case it will require a fullscan both for finding products and aggregation in each columns. At least I don't know how to build an effective b-tree index for this task. Oracle bitmap index could help, but I can't use Oracle.

Currently, we are using MySQL for this task, but it's not showing good results. Actually we are using integer representation (we grouped features and store integers in columns, not bool values) to reduce the amount of columns.

It's possible to treat this matrix as a sparse binary matrix. And it's not a big problem to store it completely in memory. And I'm wondering if it's possible to apply some algorithms to work with sparse matrices, vector space (SVD, matrix-vector multiplications and so on). But it probably helps in finding products that satisfy vector Q, not in aggregation. The problem lies more in time of aggregation, not in space.

Probably, it's possible to store matrix as a multi-linked list that would help to find products and make aggregation for each column.

Finally, the question is how to treat this task. What is the most effective algorithm to find products with given features and then count the products with additional features (aggregate by each column).

symcbean, a color is only for example - to illustrate the task. Features can have very different nature. As I mentioned, it's possible to lower the ammount of columns by groupping some features and making theirs id in a group uniq. So they become int values, not binary. Each column will store speciefic group of the features. This allows us to have only 50 columns. But anyway, aggregation in even 50 columns is not so fast thing. — Andre, Nov 01 '10 at 14:03

Jaydee · Answer 1 · 2010-11-01T14:07:17.380

1

If I understand your current solution, you have one table with a row for each product with a column for each feature. This is quite an inefficient way to solve the problem.

How about three tables

"products" (product_ref, product_name) index product_ref (a list of products)

"features" (feature_ref, description) index feature_ref (a list of possible features)

and

"productfeatures" (product_ref, feature_ref) index product_ref,feature_ref and feature_ref,product_ref (each row represents a feature of a product)

You can then perform a join between the tables

select * from products t1 join productfeatures t2 join productfeatures t3 where t1.product_ref=t2.product_ref and t1.product_ref=t3.product_ref and t2.feature_ref=45 and t3.feature_ref=67 etc.

edited Nov 01 '10 at 14:07

answered Nov 01 '10 at 11:17

Jaydee

4,138
1
19
20

Your schema is correct from the normalization point of view. But "productfeatures" might have about 36 000 000 records (if we forget about the sparseness). Also it will be very hard to find how many products we have groupped by feature, that are also satisfy a given set of features (from the CPU point of view). So I have intentionally denormalize the table to make the computation faster. – Andre Nov 01 '10 at 13:10
Forgetting about the "sparseness" is a mistake. Try it and see. All I can say is that it works for me and f00. – Jaydee Nov 01 '10 at 13:57
OK. Let's talk about 1 000 000 records. Yes, this schema works greatly for FINDING products that have some features. But how should we count how many products have next feature user could select if he already selected one of the features. In schema you presented we have to run your SELECT as many times as many features we have (minus one - which is already used in the query). – Andre Nov 01 '10 at 14:08
Probably with a subquery variation of the above, depends on your use case. "select count(*) as numproducts from ("previous query") t4 join productfeatures t5 where t5.product_ref=t4.product_ref and t5.feature_ref = 532 – Jaydee Nov 01 '10 at 14:23
Yes, there are several ways to do this using the schema, but they are consume a lot of time and space. I would like to have a tool/algorithm that will give me results in 0.01-0.05 sec. – Andre Nov 01 '10 at 14:35

score 1 · Answer 2 · edited May 23 '17 at 12:26

please have a look at this example I did some-while ago, it follows what jaydee correctly outlined, but in more detail and against 125 million poster_categories (car_features) with a runtime of 0.02 secs - yours, worst case would have 40K++ * 900 cols = 36 million rows i.e, it's a baby !

Rewriting mysql select to reduce time and writing tmp to disk

select count(*) from category
count(*)
========
500,000


select count(*) from poster
count(*)
========
1,000,000

select count(*) from poster_category
count(*)
========
125,675,688

select count(*) from poster_category where cat_id = 623
count(*)
========
342,820

explain
select
 p.*,
 c.*
from
 poster_category pc
inner join category c on pc.cat_id = c.cat_id
inner join poster p on pc.poster_id = p.poster_id
where
 pc.cat_id = 623
order by
 p.name
limit 32;

id  select_type table   type    possible_keys   key     key_len ref                         rows
==  =========== =====   ====    =============   ===     ======= ===                         ====
1   SIMPLE      c       const   PRIMARY         PRIMARY 3       const                       1   
1   SIMPLE      p       index   PRIMARY         name    257     null                        32  
1   SIMPLE      pc      eq_ref  PRIMARY         PRIMARY 7       const,foo_db.p.poster_id    1   

select
 p.*,
 c.*
from
 poster_category pc
inner join category c on pc.cat_id = c.cat_id
inner join poster p on pc.poster_id = p.poster_id
where
 pc.cat_id = 623
order by
 p.name
limit 32;

0.021: Query OK

f00, thanks for your reply! This schema solves only first part of the task. The most important thing is aggregation. I.e. if I have already choosen cat_id=623, how many products I will get if I choose additional category cat_id=624, cat_id=625 and so on to the query. Example from real life. User chooses blue cars, we need to show how many blue cars with diezel engines we have. And how many BMW blue cars we have... So if we'd make aggregation for each cat_id it will take 0,0218*342820 = 7199 sec =~ 2 hours. — Andre, Nov 01 '10 at 13:53

score 1 · Accepted Answer · answered Nov 01 '10 at 15:20

You could arrange your data by column. i.e. have one BitSet for column listing the cars/rows which have that feature.

This allows you to take advantage of the fast and/or operations provided by BitSet.

Using these features you should be able to achieve less than 2 micro-seconds for the search and the count of each feature.

For a 40,000 * 900 dataset, the following prints

average search time 1396 ns.
average count time 1234 ns.

This should a few orders of magnitude faster than you can get with a RDBMS database. Even one million rows, take less than 50 micro-seconds each.

public static void main(String... args) throws IOException {
    final int rows = 40 * 1000;
    final int cols = 900;

    List<BitSet> features = new ArrayList<BitSet>();
    features.add(new BitSet(rows));
    features.add(new BitSet(rows));
    for (int i = 2; i < cols; i++) {
        final BitSet bs = new BitSet(rows);
        for (int j = 0; j < rows; j++)
            bs.set(j, j % i == 0);
        features.add(bs);
    }

    // perform the search
    int[] factors = new int[]{2, 5, 7};
    BitSet matches = new BitSet();
    long runs =  1000*1000;
    {
        long start = System.nanoTime();
        for (int i = 0; i < runs; i++) {
            // perform lookup.
            lookup(matches, features, factors);
        }
        long avgTime = (System.nanoTime() - start) / runs;
        System.out.println("average search time " + avgTime  + " ns.");
    }
    {
        long start = System.nanoTime();
        int count9 = 0;
        BitSet col9matched = new BitSet(cols);
        for (int i = 0; i < runs; i++) {
            final int index = 9;
            final BitSet feature = features.get(index);
            count9 = countMatches(col9matched, matches, feature);
        }
        long avgTime = (System.nanoTime() - start) / runs;
        System.out.println("average count time " + avgTime + " ns.");
    }
}

private static int countMatches(BitSet scratch, BitSet matches, BitSet feature) {
    // recycle.
    scratch.clear();
    scratch.or(matches);
    scratch.and(feature);
    return scratch.cardinality();
}

private static void lookup(BitSet matches, List<BitSet> data, int[] factors) {
    matches.clear();
    matches.or(data.get(factors[0]));
    for (int i = 1, factorsLength = factors.length; i < factorsLength; i++) {
        matches.and(data.get(factors[i]));
    }
}

Wow, that is fantastic!!! It's really what I'm searching for! Thank you, Peter, you helped me a lot! I'm going to integrate this code to our PHP frontend using Apache Thrift. — Andre, Nov 01 '10 at 19:48
I have to grab additional reputation to do this :) First day here :) Thank you once again! — Andre, Nov 01 '10 at 21:55

Improve performance of matrices/table aggregation and search

3 Answers3