5

I need to do a lot of checks if the intersection set of two sets (one is identical for all checks, the other one changes) is empty or not.

It's ok if the check says (in a small amount of checks) it's not empty, but it is (there can be a second filtering step which is more precise), so false positives are ok. It's not allowed, that I filter out something that defnitly has a not empty intersections, so false negatives are not ok.

So, only a scenario:

{A,B,C,D} <-> {D,E,F} => true (D in intersection set), never allowed to be false

{A,B,C} <-> {D,E,F} => false (no intersection set), can also return true in a minor number of checks

For a single element I would use a bloom filter, but I can not find something similar for a set of elements and bloom filter checking element by element would be a possible option, but I'm searching for something better.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
Peter V.
  • 71
  • 3
  • 1
    How big are/can-be the sets? – Scott Hunter Nov 20 '17 at 13:18
  • Number of changing sets is really big (100 Million), but I have time/memory to create/store it. The challenge here is that the check needs to be very fast. Also I can not use a lot of code for the check, only conditions are allowed (at the end it's a sql query) – Peter V. Nov 20 '17 at 13:39
  • @ScottHunter: Number of changing sets >100M, Number of elements > 5000, Size of a set Min. 1, Avg. 10, Max. 50 – Peter V. Nov 20 '17 at 13:40
  • Do you have an *upper* bound on the elements that each set draws on (you said > 5000)? And how much latitude do you have in the representation in the DB? – Scott Hunter Nov 20 '17 at 13:49
  • Are you looking to minimize runtime or false positive rate? What is the objective function? Two obvious solutions are 100% false positives & O(1) runtime, and 0% false positives & O(n^2) runtime. The runtime of the latter is easily improved to O(n log n) by sorting. To have a bounded false positive rate less than 100% I strongly suspect you must have runtime at least O(n). – Patrick87 Nov 20 '17 at 13:54
  • 1
    Of what type are the elements in the sets? How many different elements exist in the domain? – MrSmith42 Nov 20 '17 at 13:59
  • @Scott Hunter: The use case is a permission check. So changing set contain the user-groups assigned to the element. The fixed set contain the user-groups you belong to. As soon as there is an not empty intersection set you know you're allowed to see the element. There is no real boundry how many groups you can be part of, but you have to specify it manually (which limits it not hardcoded but by effort you have to spend). Group can be also a company/customer of the system so that means there is defnitly no upper border. – Peter V. Nov 20 '17 at 15:29
  • @Patrick87: That's connected. At the end (after I get all changing sets that are found to have a non empty intersection set) I have to do a 100% check (which is a lot more expensive and I can't do everything there, so it's only a prefiltering), that means if I reduce false positives I have to do less Full/Expensive checks and save a lot of memory to hold the result. So neither of the two extremes are possible. For 100% / O(1) I don't have enough memory and for 0% / O(n^2) I can't express the check-logic in a simple condition, that can be used in e.g. SQL. – Peter V. Nov 20 '17 at 15:41
  • @PeterV. are your elements sorted within each list? – Luai Ghunim Nov 20 '17 at 15:41
  • @LuaiGhunim: Yes (could be done easily) – Peter V. Nov 20 '17 at 15:46
  • It'd be really helpful if you would edit your question and add the information that you've supplied in comments so that we can see everything in one place rather than trying to pick it out of comments. – Jim Mischel Nov 20 '17 at 19:08
  • The nature of your problem is still unclear to me. Are you saying that you have to check each fixed set against all 100 million changing sets? It's also not clear where you're drawing the line between program code and SQL queries. Your problem statement is so vague that it's impossible to give any reasonable suggestion. – Jim Mischel Nov 20 '17 at 19:15

3 Answers3

2

Thanks a lot for your answers, helped me to come up with a good solution and solve the problem.

The idea is mostly really primitive, but sufficient.

I create two bitsets, one for the changing set and one for the fixed set. Each element of a set is hashed to one bit (so e.g. for a long one bit in 1 to 64) and then combined for a set (mostly a Bloom-Bitset with k=1).

To check if there is a non empty intersection set I simply need to combine the two bitsets with bit-and-operation and check if the result is not 0.

The false-positive rate would be (I think, didn't do the math) worse, but it's good enough for my case.

Example:

[A,B,C] => 0000100010100000

[B,D,F] => 0100000010000100

---------------------- &

0000000010000000 != 0 => true

Peter V.
  • 71
  • 3
  • If there are n and m elements mapped to a k-bit set, with k much larger than n or m, the false positive rate should be about nm/k. – Veedrac Dec 01 '18 at 12:03
0

One optimization would be to keep a list (array for fast lookup) with the min/max values for each set. Then first check in that list if they overlap. If not -> return false - no further checks needed.

S1: a b c
S2:       d e f

S1 and S2 -> false (no overlap)

If the sets are sorted and they do overlap, you only have to check the overlapping region.

S1: a b c d e
S2:       d e f g h

Only check the 'd e' region

If you need to check the intersection of more than 2 sets, first try to find two non-overlapping sets. If found -> return false. If not, only check the overlapping region for all those sets (which should get smaller with more sets).

S1: a b c d e
S2:       d e f g h
S3:             g h i

S1 and S2 and S3 -> false (S1 and S3 do not overlap)

If most of the sets span a wide range, you could use another option:

Lets say the maximum number of elements is 6400 (for this example), and every element is, or can be converted to an integer 1-6400.

For each set a small bitmap (64 bit unsigned integer) could be created, with one bit representing 100 items.

For example:

S1: 1,23,80,312,340
S2: 160,184,450
S3: 230,250,340

S1 bitmap: 100100..
S2 bitmap: 010010..
S3 bitmap: 001100..

S1 and S2 -> false
S1 and S3 -> true, only check range 301-400
S1 and S2 and S3 -> false

You could of course use a smaller number than 100 (preferable a power of 2, so you can quickly set the corresponding bit) and use multiple uint64.

This could even be done in multiple levels (depending on the amount of memory / storage space you're willing to use). For example, first a real quick check on one 64 bit integer (takes one CPU cycle and can easily be done with SQL). Only for those that match check a second level containing maybe 4, 8 or 16 uint64, with each bit representing a smaller range of values (could also be really fast using SSE/AVX registers). If they still match do a deeper check, but only for the ranges corresponding to the set bits in the result.

Danny_ds
  • 11,201
  • 1
  • 24
  • 46
0

You mentioned you're doing it in sql. So we have smth like this:

  • PatternSet (ElemId int16 primary key): first table with the set to check against
  • ProbablyChangedSets (ElemId int16, SetId int, primary key(ElemId, SetId)): second table consisting sets to check against PatternSet

I'm curious is this query performance just not enough?

-- sets with intersections
select distinct
   cs.SetId
from ProbablyChangedSets cs
join PatternSet s on
    cs.ElemId = s.ElemId

-- |cs| = setCount * avgSetSize = 10^8 * 10 = 10^9
-- |s|  = avgSetSize = 10
-- numberOfComparisons ~= 10^9 * 10 = 10^10, comparisonComplexity = O(1)

With enough parallelization it'd be very fast - it's couple of seconds.

Or your checks are sequential and you need to optimize single check operation?

pkuderov
  • 3,501
  • 2
  • 28
  • 46
  • Nearly. I would prefer a solution where I do not have to create on the one hand the M-N Table between Set and Pattern and on the other hand have to build up the huge join table on each request, which is costly. (The content of PatternSet is different per query, so you have mostly only the ProbablyChangedSets Table and cs.ElementId IN (A,B,C)). For Checking single element with bloom-filter I would only have to store a bitset and check with "WHERE SetBitset & ElementBitset = ElementBitset" andI would love to have something similar for checking not only one element. – Peter V. Nov 20 '17 at 16:11
  • creation of bitset, hash and so on needs the same `at least O(setSize)` operations as the search of every set's element in the pattern set. – pkuderov Nov 20 '17 at 16:17
  • I don't have to create the hash at query execution time. I can create it already when I create the Set. The hash for the query needs to be created only one per query, so that would be not a performance problem. – Peter V. Nov 20 '17 at 16:29
  • ok, I see. Thanks! Then bucketing as @Danny_ds proposed in his answer looks like viable option, imo. – pkuderov Nov 20 '17 at 16:30