14

Normally you write a query and get all the records (entities) that match it. I need to do the reverse.

Let's say I have 1M customers with a couple dozen denormalized properties:

public class Customer {
  public string Name {get;set;}
  public string Email {get;set;}
  public string Phone {get;set;}
  public DateTime Birthday {get;set;}
  public DateTime LastEmailed {get;set;}
  public DateTime LastCalled {get;set;}
  public int AgeInYears {get { return DateTime.UtcNow.Year - birthdate.Year;}}
  public int SalesTerritoryId {get;set;}
  // etc.

}

And I have 10k users that want to setup custom filters and be notified when any new customer matches the rules they defined.

Some of these rules are evaluated when the customer is created/updated (e.g.)

  • Customers with an phone number AND in my sales territory.
  • Customers with an email AND LastEmailed is NULL AND sales territory IN (1, 7, 11)

Other rules will run periodically (e.g.)

  • Customers with a birthday today.

On a daily basis there will be millions of saves to customers and 5-10k custom filters to be checked against each new/updated customer.

I realize I could use Expression Trees for the user's filters, but then end up doing something like this:

public class CustomerRule : IRule {

  public bool IsMatch() {
    // Expression Tree Stuff
  }

  public bool DoAction() {
    // Notification Stuff
  }
}

public class CustomerService {

  public void SaveOrUpdate {
    IList<IRule> rules = GetRules();

    // this isn't going to handle 1M save/updates * 10k rules very well
    foreach (var rule in rules){
      if(rule.IsMatch()) {
        rule.DoAction();
      }          
    }      
  }
}

I know others have solved this problem but I'm having a hard time figuring out what exactly to look for. General guidance is appreciated, specific patterns, code, tools, etc. even better. We primarily use C# but can go outside the .NET world if need be.

Community
  • 1
  • 1
Kyle West
  • 8,934
  • 13
  • 65
  • 97
  • 2
    My first thought goes to a way to decouple the rule checker engine from the CRUD service. Put any customer change event in a queue, and process that queue asynchronously using another service which will check for any rule match. That will scale without overloading the main service. – Federico Dipuma Apr 10 '17 at 15:41
  • That's the plan, but we still need to handle millions of rule checks/actions a day and there has to be a better way than running a huge foreach loop on each one. – Kyle West Apr 11 '17 at 13:45
  • I'm sorry, but I cannot think about anything a part from parallelizing the load (maybe also using `Parallel.ForEach`). If you *have* to check 10k rules, then you need to execute 10k operations, no less. Maybe another approach could be to reduce the number of checks by grouping similar rules from different users together (e.g. execute rule "IsCustomerMale" just once). – Federico Dipuma Apr 11 '17 at 13:58
  • Do you have some acceptable delay between event and notification? Say notification should be delivered immediately or for example it's fine to have a delay up to X minutes. – Evk Apr 12 '17 at 15:50
  • Are the custom rules always `AND` conditions? Never `OR` conditions? – user1935361 Apr 12 '17 at 17:54
  • @Evk - it doesn't need to be realtime, but should be within a minute or two. – Kyle West Apr 12 '17 at 20:06
  • @user1935361 can be either. – Kyle West Apr 12 '17 at 20:06
  • Have you considered using the WWF rule engine? https://msdn.microsoft.com/en-us/library/dd554919.aspx It can be used outside workflow. Some considerations about its performances here: http://geekswithblogs.net/cyoung/articles/114597.aspx As far as I understand things, it implements the Rete algorithm. – David Brabant Apr 13 '17 at 05:28
  • About using the rule engine outside a workflow: https://cgeers.wordpress.com/2008/01/26/using-rules-outside-of-a-workflow/ – David Brabant Apr 13 '17 at 05:34
  • @DavidBrabant - thanks, I will look into that. – Kyle West Apr 13 '17 at 15:17
  • If you are using MSSQL for your database, have you considered using Triggers? You could load up all 'tasks' that need to be run in another table and have a cron that does all the notifications. You could even write a trigger that puts all the notifications into the data base for you - then all you would have to do is send them. – Michael Coxon Apr 17 '17 at 10:11
  • In a generic repository, I have a check for certain fields, e.g. added/updated that automatically get updated if they exist using reflection. You could take this a step further and check for any observing rules placed on the customer and execute the checks on add/update. – Jonathan Walton Apr 19 '17 at 05:44

6 Answers6

10

I'd mention different point than other answers. You claim in your code that

// this isn't going to handle 1M save/updates * 10k rules very well

But did you really verified this? Consider this code:

public class Program {
    static List<Func<Customer, bool>> _rules = new List<Func<Customer, bool>>();
    static void Main(string[] args) {
        foreach (var i in Enumerable.Range(0, 10000)) {
            // generate simple expression, but joined with OR conditions because 
            // in this case (on random data) it will have to check them all
            // c => c.Name == ".." || c.Email == Y || c.LastEmailed > Z || territories.Contains(c.TerritoryID)

            var customer = Expression.Parameter(typeof(Customer), "c");
            var name = Expression.Constant(RandomString(10));
            var email = Expression.Constant(RandomString(12));
            var lastEmailed = Expression.Constant(DateTime.Now.AddYears(-20));
            var salesTerritories = Expression.Constant(Enumerable.Range(0, 5).Select(c => random.Next()).ToArray());
            var exp = Expression.OrElse(Expression.OrElse(Expression.OrElse(
            Expression.Equal(Expression.PropertyOrField(customer, "Name"), name),
            Expression.Equal(Expression.PropertyOrField(customer, "Email"), email)),
            Expression.GreaterThan(Expression.PropertyOrField(customer, "LastEmailed"), lastEmailed)),
            Expression.Call(typeof(Enumerable), "Contains", new Type[] {typeof(int)}, salesTerritories, Expression.PropertyOrField(customer, "SalesTerritoryId")));
            // compile
            var l = Expression.Lambda<Func<Customer, bool>>(exp, customer).Compile();
            _rules.Add(l);
        }

        var customers = new List<Customer>();
        // generate 1M customers
        foreach (var i in Enumerable.Range(0, 1_000_000)) {
            var cust = new Customer();
            cust.Name = RandomString(10);
            cust.Email = RandomString(10);
            cust.Phone = RandomString(10);
            cust.Birthday = DateTime.Now.AddYears(random.Next(-70, -10));
            cust.LastEmailed = DateTime.Now.AddDays(random.Next(-70, -10));
            cust.LastCalled = DateTime.Now.AddYears(random.Next(-70, -10));
            cust.SalesTerritoryId = random.Next();
            customers.Add(cust);
        }
        Console.WriteLine($"Started. Customers {customers.Count}, rules: {_rules.Count}");
        int matches = 0;
        var w = Stopwatch.StartNew();
        // just loop
        Parallel.ForEach(customers, c => {
            foreach (var rule in _rules) {
                if (rule(c))
                    Interlocked.Increment(ref matches);
            }
        });
        w.Stop();
        Console.WriteLine($"matches {matches}, elapsed {w.ElapsedMilliseconds}ms");
        Console.ReadKey();
    }

    private static readonly Random random = new Random();
    public static string RandomString(int length)
    {
        const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
        return new string(Enumerable.Repeat(chars, length)
          .Select(s => s[random.Next(s.Length)]).ToArray());
    }
}

public class Customer {
    public string Name { get; set; }
    public string Email { get; set; }
    public string Phone { get; set; }
    public DateTime Birthday { get; set; }
    public DateTime LastEmailed { get; set; }
    public DateTime LastCalled { get; set; }

    public int AgeInYears
    {
        get { return DateTime.UtcNow.Year - Birthday.Year; }
    }

    public int SalesTerritoryId { get; set; }
}

Here I generate 10K rules in form of expressions. They are simple, but not trivial - 4 conditions joined with OR, with strings, dates, Contains. Then I generate 1M customer updates (number of customers in your database is irrelevant - we only work with updates) and just run a loop. Guess how long does it take on my regular (non-server) PC? 4 minutes.

So all your rules for all customer updates for the whole day can be checked in just 4 minutes (at proper server it should be at least x2 faster than that, probably more). Checking single update against 10K rules takes few milliseconds. Given that - you will most likely have bottlenecks in any other place, not here. You can apply a couple of trivial optimizations on top of that, if you'd like:

  • Collapse identical rules. No need to check "is birthday today" rule for every user.

  • Store properties which are used in a rule and also note which columns were updated in Customer. Don't run rules which do not use columns updated in Customer.

But actually that might even slow you down, not speed up, so everything should be measuree.

Don't send notifications from the same code which does rule check. Put them into queue and let other process\threads handle them. Checking rules is strictly CPU bound work, and sending notifications (I assume, in your case) is IO bound, so you might actually be able to do that on one machine, in one process. You also don't want to spam given user with notifications at this rate, you will most likely send them in batches, at most one batch per minute I think, so this won't be too costly.

As for customer updates themselves - you might store them in some queue (like rabbitMQ), use database notifications (like postgresql pg_notify) or just poll database every minute to get all updates for that period. Again, perfomance of different approaches should be measured.

In addition to that, this kind of task is easily parallelizable on multiple machines, so if you will ever hit 100M customers - no problem, you can just add one more server (or maybe one will still be fine).

Evk
  • 98,527
  • 8
  • 141
  • 191
  • thanks for the code. unfortunately, 4 minutes isn't going to work. Users are expecting to be notified of new matching customers almost immediately. Some of the rules will be run on a schedule (e.g. birthday today) but most are intended to be executed as other systems add/update the customers. – Kyle West Apr 13 '17 at 15:15
  • @KyleWest but 4 minutes is for a whole batch of all updates and all rules per DAY (you said that you have 1 million updates per _day_). So if you wait all day and then want to notify users about all updates that happened - that will take 4 minutes. Obviously you won't do that but instead will check updates as they come. Checking 10K rules against single update takes few milliseconds, so your users can be notified in almost real-time. – Evk Apr 13 '17 at 15:22
  • @KyleWest Of course I assume you detect when customer was inserted or updated and only run rule check for that insert or update, not for the whole set of customers every time. I see my number is a bit confusing because you both have 1M customers and "millions" of updates per day. My example shows that handing 1M updates takes 4 minutes, number of customers is irrelevant. – Evk Apr 13 '17 at 15:42
  • That's essentially what I was thinking about. I do not believe you could achieve better performance then this. – Federico Dipuma Apr 19 '17 at 08:24
  • @KyleWest if something is not clear yet, or you have some questions or doubts - feel free to ask. You were quite quiet since your bounty started :) – Evk Apr 19 '17 at 13:56
3

Executing all of the filters in order each time a user does a request will be difficult if not impossible to get done almost immediately.

What about setting up Message Queues and then breaking the filters into different execution tasks that you add as the users do their saves?

You can set up multiple queues for different types of filters (birthdays/location/industry/etc) and then have different workers watching the queues for changes. Execute the messages in the birthday queue once a day, execute the user creations and updates, etc, continuously, and put more workers against the heavier ones to process the messages faster. You can turn on more workers at peak times and turn some off at down times.

You can break apart the workers into certain filter counts/results. So have different workers for different types of filters or longer running filters and combine the results (or add/remove results as filters get done). They can run in parallel as the tasks come in, processing the different filters at the same time.

Store the results in a document database or cache them in a Redis server and pull the results from that.

Austin Winstanley
  • 1,339
  • 9
  • 19
2

Essential question is:

How do you define and store your custom filters (the rules)?

You mention '5-10k custom filters to be checked'. If number is so big you probably have some flexible structure for the rule, like

<field> <operator> <value> (e.g. <LastEmailed> <is> <NULL>)

with all variety lying in values for <field>, <operator> and <value>.

If so then for new/updated customer you can select all the rules that satisfy his data. It could be done either by a single query or by stored procedure with some level of complexity. It really depends on design of your Database.

My main point is: if your rules are stored in your database then you could check with pure SQL whether some data meets the rule.

Such check against ~10k rules should not cost too much from performance perspective. Again it really depends on structure of your DB and size of tables that should be joined to 'compile' and check the rule.

Of course it could happen that you have some limited set of rules which are complex enough to be checked only from .NET code. It's OK to have foreach loop for them as you posted, as far as number of such rules shouldn't be large.

I agree with Federico Dipuma that asynchronous processing is an option. However it should be your second choice if above approach doesn't work. It's more likely that you select asynchronous approach for performing the actions on matched rules, because such operations are usually very time consuming (e.g. email sending or other notification, INSERT or UPDATE in the database, etc.).

CodeFuller
  • 30,317
  • 3
  • 63
  • 79
  • all the rules follow that template (` `), however, they can also be combined to create more complex rules -- think creating a smart playlist in iTunes: rating is over 3 AND last played not in the past 30 days AND genre in (rock, jazz). – Kyle West Apr 12 '17 at 20:10
  • We are not doing any calculations that we require .NET code to do them so this could be contained in the SQL database. – Kyle West Apr 12 '17 at 20:12
  • Once you get the list of simple rules that match the new/updated record you can easily find which complex ones match. You can either simply evaluate only the complex rules that contain one of the simple rules or go as far as create a bitmap from the matched simple rules and compare it to bitmaps created from the complex rules by decomposing the complex rules to "sum of products" (I admit the later needs further thought but it is merely a pointer in a different direction that might help) – P. Kouvarakis Apr 17 '17 at 21:41
2

With 1M updates and 10k rules you need to reduce the number of rules to be checked . Since you only have a couple of dozen properties this should be your selection criteria which rules to run. First filter the rules to check based on which properties are present in the rule and compare that with which properties are updated.

  • Add a SearchParameters field to the rule, and give it the value 010405 if the rule only contains parameter 01(name), 04(birthday) and 05(lastemailed).
  • Store the SearchParameters (and link to the rule) in a separate table ordered in ascending order.
  • When a user update their record get the parameters that are updated by number so 02, 06 and 07 if those parameters are updated.
  • Than in the list of SearchParameters find all values (and the corresponding link to the rule) containing the SearchParameters of the update. Since this is an ordered list this can be done very effectively.
  • Now you have a reduced list of rules with only the rules containing at least one of the changed parameters. This reduced list of rules you need to check in you for each loop.

I hope the idea is clear, here a different/better implementation options.

I think a more efficient implementation can be done with a 2D boolean array where each row is a rule and each column is a parameter. So something like this:

rules  | param1 | param2 | param3 | ...
rule1  |   0    |   1    |   0    | ...
rule2  |   1    |   0    |   1    | ...
rule3  |   1    |   1    |   1    | ...

Than upon an update just get the column of the appropriate parameter and get all the rules where the parameter is 1.

Another option (in think the best and fastest) is basing it completely upon SQL. The basic idea remains relatively the same, except the rules should be stored as SQL in the rules table, so you get the following table:

rule_table
ruleNr  | param1 | param2 | param3 | rule
   1    |   0    |   1    |   0    | SELECT recordID FROM Customer WHERE name LIKE 'Will%' AND location = US; 
   2    |   1    |   0    |   1    | SELECT recordID FROM Customer WHERE name = 'West' AND ...;
   3    |   1    |   1    |   1    | SELECT recordID FROM Customer WHERE ...;

Upon an update or creation of a customer run the following query, this selects all the rules containing one of the updated parameters. Where all the updated parameters should be in the query.

  SELECT rule FROM rule_table WHERE param1 = 1 OR param4 = 1 OR ....

This query gives a list of applicable SQL rules, which should already be formatted in the correct way. Loop through each SQL query and process the results. The results of the SQL query stored in the table is a list with of recordIDs pointing to that specific customer record.

Hope this helps a bit.

D.J. Klomp
  • 2,429
  • 1
  • 15
  • 30
2

You definitely don't want to delay saving the record to the database to run rules. Any error occurring within IsMatch() or DoAction() could potentially abort the data being saved. I would assume an alert to the fact that it is someone's birthday is not nearly as vital as actually adding the person to the database.

I would think to add the add/update event to a queuing system. Now don't go thinking of a queuing system as a place for things to stack up and wait for long periods of time! Windows OS is a queuing system, it uses message queues for pretty much everything. So the CustomerService.SaveOrUpdate method you posted would send an event (or message, I would find it easier to think of it as an event) to your "UpdatedUser" queue. This queue would have one or more listeners, waiting for events to show up. They would then take that event and find any rules that match it's data and do the appropriate actions.

The beauty of using the queuing system is that you can offload the processing to a dedicated machine and not muck up the system that is responsible for getting data saved into your data stores. The queue listener in charge of processing the rules can load the rules into memory which will allow it to find which rules apply much faster than loading them from a database for each of the 10s of thousands of updates daily. I would venture to say the GetRules() is a fairly intensive process as it would probably read the generated rules from a database and convert them each into an Expression Tree objects. Having a dedicated rules engine that listens to a queue for things to apply it's rules against would be quicker!

One of the best things about the queue/listener approach is that it is very extensible. If the queue ever starts backing up and your rules engine just can't keep up, you have options! Fastest/easies way to keep the queue lower... start up another rules engine that listens to the same queue! That's right you can have multiple listeners to a queue and depending on how you set things up, you can ensure that a message is sent one and only one listener.

Another bonus, when you need to update the rules engine, you can take the existing one out of service, replace it and start up the new code base. You would have no fear of missing anything, the queue would continue queuing up the events and when you start the new code it will start processing these events.

Queue/listeners are pretty easy to setup at test out. I've used MSMQ for a couple of my Microsoft stack solutions. I have also used activeMQ for a java based solution.

So you combine this with what Evk said... your solution with Expression Trees is not slow, at least once the rules are in memory. While on that subject, you would want to have these "rules in memory" refresh periodically. You could have a set period, like every 15 minutes or you could go more elaborate and have an event fire when the rule's SaveOrUpdate method is called. I might opt for the event firing, but this all would depend on the business need.

You could also skip the queue method and just create a service, say WCF, that would accept the customer data and processes the rules. If your alerts are say firing within the client that is saving the data, they could wait for the reply or you could use a duplex service where the service can push alerts to the client. Only downside to this is approach is there would only be one specific service being used so you could not double the throughput by simply firing up a second service. You can add the ability to push notifications to the clients from the queue/listener, it's just a bit more work.

Anyway, long story short - too late! There are options that would make your current Expression Tree implementation very viable. I personally think you are on the right track with it. I get the impression that your needs are to have end-users creating and maintaining these rules and therefore they can not be too rigid and therefore creating any sort of grouping/binary solution to rapidly dismiss large groups of rules is not going to be an option. You would end up working on managing rule groups much longer than any time savings in speed.

I guess I have a lot to say on this and really no coding examples as you would need to choose a queue technology and probably just walk through their "getting started" docs.

Good luck

Community
  • 1
  • 1
Larry Dukek
  • 2,179
  • 15
  • 16