Where to process data? Db or locally?

Question

We are working on a solution which fires many search requests torwards three different public databases placed in three different countries. For example a search fetches data from one db and passes them as parameter to another db. The parameter is a list which each item needs to be logically connected with an OR operator. Therefore we end up having a sql select statement with up to 1000 OR operators linked inside the where clause.

Now my question is does 1000 or 500 or even 5000 logical AND or OR Operators inside select statement make the db slower and should I instead better request all data to the pc and do the matching on my pc.

The amount of is data is between 5000 and 10000 records, we are talking about a public db therefore the amount keeps growing.

For example such a sql statement:

select * from some_table
where .. and .. or .. or.. or..
      or.. or.. or.. or.. or.. or.. (1000 times)

If I fetch all data to my pc I could have a LINQ Statement that does the filtering.

What do you suggest me to do? Any experiences on this one guys?

Sorry if this is a duplicate just let me know in comments and I'll delete this question.

EDIT:

It should be considered that many users may access the databases at the same time.

If you can manage it, try to bulk-load the ids into a temp table that you can just join to (most DBs have utilities for this). This will certainly allow for _cleaner_ queries, and might be faster than concatenating values for a dynamic query... — Clockwork-Muse, Mar 25 '14 at 14:55
Thousands of `OR` boolean results are not as intimidating as they may seem. For example if you represent `TRUE` as 1 and `FALSE` as 0, in a series of OR statements, the sum of all several thousand boolean results translated this way just needs to be greater than 0 to meet the criteria. — Richard Pascual, Mar 26 '14 at 07:50
@Pascual What do you mean? How could I tell select statement to use 1 and 0 ? :) — dev hedgehog, Mar 26 '14 at 07:55
@devhedgehog, this is more of an aside to the original post, but I'd be glad to explain. I set up a mock query on my Github account. Take a look at [this example](https://gist.github.com/richardpascual/9778911) and let me know what you think. — Richard Pascual, Mar 26 '14 at 08:37
If you can have 1000 cases and it runs fast I buy it. :) :) :) — dev hedgehog, Mar 26 '14 at 09:29

score 1 · Accepted Answer · edited May 23 '17 at 12:05

1

I always learned that running a query with hundreds of OR statements are bad for performance. However, even when running a sample here on 12g, querying a table with or or in using an primary key index doesn't seem to change the execution plan.

Therefore I say: it doesn't matter. The only things you could consider are readability, query length, etc.

Still, I personally prefer the where in.

See this other useful question with sample data.

edited May 23 '17 at 12:05

Community

1
1

answered Mar 25 '14 at 08:08

Patrick Hofman

153,850
22
249
325

I do not think its weird since I need data from one db for the select inside another. The databases need to play together. Do you have any facts on how bad is the performance with so many Or operators? – dev hedgehog Mar 25 '14 at 08:13
Maybe this reference is useful: http://stackoverflow.com/a/16608145/993547. There is also some test data in another answer. – Patrick Hofman Mar 25 '14 at 09:42
Basically that reference tells me that IN will be translated to OR operators internally. Still its not helping me much out which decision to make. – dev hedgehog Mar 25 '14 at 11:36
So the answer is, I should do everything in the database and send the exact result over network to my pc? – dev hedgehog Mar 25 '14 at 13:52
@devhedgehog: In my opinion, data filtering is something databases are built for, nothing else has that performance. – Patrick Hofman Mar 25 '14 at 13:53
About the `WHERE IN` approach, this may work where the number of key values delivered in the IN-LIST are limited and in controlled batches. I collaborated with a group of Perl developers on an Oracle 10g DB using the IN-LIST approach. It was easy enough for their scripts to generate dynamic SQL with delimited lists of ID's for the in-list operator. We learned that `ORACLE` in this version chokes on in lists at around 5,000 elements. – Richard Pascual Mar 26 '14 at 07:54

score 1 · Answer 2 · answered Mar 25 '14 at 18:28

Process this all in the database with a single query. Batching similar operations is usually the best thing you can do for database performance.

The most expensive part of the query is reading the data from disk. Once the data is in memory, filtering out a few thousand conditions is a small amount of work. Your local processor probably is faster than the database server. But it doesn't matter because your machine would spend too much time on unnecessary IO if you returned all the records.

Also, 5000 conditions in a SQL query is only a problem if you run that query a hundred times a second.

score 0 · Answer 3 · answered Mar 25 '14 at 08:06

0

I think you should just try.

Create an example that is a simple as possible, yet complex enough to be realistic, and then run it with some form of benchmarking

Whatever works best for you is what you should choose to do.

Edit:

That said - such a large number of and's and or's in a single SQL statement does sound complicated and messy. Unless there is a real benefit from doing it this way(?), I would probably try to find a cleaner way to do this, for instance by splitting the operation into several steps and applying Linq or something similar, as you sugest, even if it is just to make the solution more manageable.

answered Mar 25 '14 at 08:06

Kjartan

18,591
15
71
96

That is what I thought too just there is no time to try it out. I need opinions and experiences on this one. Right now I am in no position to take three days and come up with few test cases and code them out. I was hoping there might be a clear rules to apply in such cases. I am asking you guys. – dev hedgehog Mar 25 '14 at 08:10
1

Ok. Seems a little risky though, as it's hard to know how different strategies will work in your case in particular, since it may depend on a lot of factors, from the data you are fetching to the hardware it is running on. Make use of the answers here then, but I would still recommend making at least a minimal proof of concept to test it, even if it was quite simple and small - just to verify that it is not completely off the mark before you go ahead and implement something more advanced. – Kjartan Mar 25 '14 at 08:13
1

I am thinking right now that even when I fetch all to my pc I would still need to do the filtering and that will take time. Filtering will also take time if done in database therefore its a fix constant to me in my opinion. However if I then let the database do the filtering job I would need to transfer less data to my pc and there I see performance gain. That is just what I am thinking right now. Am I wrong? Have you ever had to do something like this? – dev hedgehog Mar 25 '14 at 08:19
Hi @devhedgehog, you are on the right track with your latest thinking about this situation. Two key concepts: Reduce I/O and the amount of data pushed between the DB and the workstation/end user. Filtering through data isn't really a fixed constant. Workstations are fairly poor at executing filtering tasks compared to a database server engine. – Richard Pascual Mar 25 '14 at 12:21
Workstations are PCs, and Database server engine is always stronger than workstations. Well at least usually. Right? :) – dev hedgehog Mar 25 '14 at 13:51
Yes, for a developer, PC's and database servers are sometimes one and the same. At that point, a developer doesn't see issues with scalability until the whole solution gets deployed. You have to think a little further in advance in that respect. See my answer to follow for a few more comments on my observations. – Richard Pascual Mar 25 '14 at 17:02

score 0 · Answer 4 · answered Mar 25 '14 at 08:07

0

The answer is - depend

How big is the data on the public db ? if you are querying Google than fetching all the data is not an option.
It would be reasonable to assume that those public db have much stronger hardware and db tuning than your home pc.
Is there an option that you will got black listed from those public db ?
Does order matter ? if you query db 1 and then db 2 will be faster then query db 2 and then db 1 ?

Mostly it's try & error and what work best for you and is possible.

answered Mar 25 '14 at 08:07

Mzf

5,210
2
24
37

Its pretty big its about 10 000 records right now but will probably grow. I think the hardware is normal one I guess. I dont know about the hardware. Order does matter, one db holds infos about persons, another about activities. A search would be like find all activities associated to persons from city London. I need to go to person db first and get all from London and then I go to second db to find their activities. – dev hedgehog Mar 25 '14 at 08:15

score 0 · Answer 5 · edited Jun 20 '20 at 09:12

SQL Queries on ORACLE Using Multiple Boolean Operators

Comments: I have worked many years with CRYSTAL REPORTS, a database report designer. It was one of the first drag-and-drop, GUI based tools which made it easier for developers with not much database background to construct queries with multiple tables and filter conditions. The trade-off was that the the tool was writing SQL under the hood; many times it was a serious performance hog because the workstation running the report file had to suck down the entire contents of the database tables being queried, only to run the filtering process locally on the client system. That was more than a decade ago, but I see other next-gen tools that also auto-generate really awful SQL code.

No amount of software can compensate for lousy database design. You won't get everything right the first time (as others have noticed), but a little planning can give some breathing room when the product reveals under real-world use the demands of PERFORMANCE and SCALABILITY.

Demonstration Schema and Test Data

The following solution was designed on an ORACLE 11g Release 2 RDBMS system. The first table can be represented by a database VIEW, INLINE QUERY, SUB QUERY, MATERIALIZED VIEW or even a CURSOR output, so the "attributes" discussed in this example could be coming from multiple table sources and joining criteria.

 CREATE TABLE  "ZZ_DATA_ATTRIBUTES" 
    (   "DATA_ID" NUMBER(10,0) NOT NULL ENABLE, 
        "NAME" VARCHAR2(50), 
        "AGE" NUMBER(5,0), 
        "HH_SIZE" NUMBER(5,0), 
        "SURVEY_SCORE" NUMBER(5,0), 
        "DMA_REGION" VARCHAR2(100), 
        "LAST_CONTACT" DATE, 
        CONSTRAINT "ZZ_DATA_ATTRIBUTES_PK" PRIMARY KEY ("DATA_ID") ENABLE
    )
 /

 CREATE SEQUENCE   "ZZ_DATA_ATTRIBUTES_SEQ"  MINVALUE 1 MAXVALUE
 9999999999999999999999999999 INCREMENT BY 1 START WITH 41 CACHE 20 NOORDER  NOCYCLE
 /

 CREATE OR REPLACE TRIGGER  "BI_ZZ_DATA_ATTRIBUTES" 
   before insert on "ZZ_DATA_ATTRIBUTES"               
   for each row  
 begin   
   if :NEW."DATA_ID" is null then 
     select "ZZ_DATA_ATTRIBUTES_SEQ".nextval into :NEW."DATA_ID" from sys.dual; 
   end if; 
 end; 

 /
 ALTER TRIGGER  "BI_ZZ_DATA_ATTRIBUTES" ENABLE
 /

The SEQUENCE and TRIGGER objects are just for unique, auto-incremented values for the primary key on each table.

 CREATE TABLE  "ZZ_CONDITION_RESULTS" 
    (   "RESULT_ID" NUMBER(10,0) NOT NULL ENABLE, 
        "DATA_ID" NUMBER(10,0) NOT NULL ENABLE, 
        "COND_ONE" NUMBER(10,0), 
        "COND_TWO" NUMBER(10,0), 
        "COND_THREE" NUMBER(10,0), 
        "COND_FOUR" NUMBER(10,0), 
        "COND_FIVE" NUMBER(10,0), 
        CONSTRAINT "ZZ_CONDITION_RESULTS_PK" PRIMARY KEY ("RESULT_ID") ENABLE
    )
 /

 ALTER TABLE  "ZZ_CONDITION_RESULTS" ADD CONSTRAINT "ZZ_CONDITION_RESULTS_FK" 
 FOREIGN KEY ("DATA_ID") REFERENCES  "ZZ_DATA_ATTRIBUTES" ("DATA_ID") ENABLE
 /

 CREATE SEQUENCE   "ZZ_CONDITION_RESULTS_SEQ"  MINVALUE 1 MAXVALUE
 9999999999999999999999999999 INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER  NOCYCLE
 /

 CREATE OR REPLACE TRIGGER  "BI_ZZ_CONDITION_RESULTS" 
   before insert on "ZZ_CONDITION_RESULTS"               
   for each row  
 begin   
   if :NEW."RESULT_ID" is null then 
     select "ZZ_CONDITION_RESULTS_SEQ".nextval into :NEW."RESULT_ID" from sys.dual; 
   end if; 
 end; 

 /
 ALTER TRIGGER  "BI_ZZ_CONDITION_RESULTS" ENABLE
 /

The table ZZ_CONDITION_RESULTS should be a TABLE type. It will contain the results of each individual boolean OR criteria. While 1000's of columns may not be practically feasible, the initial approach will show how you can line up lots of boolean outputs and be able to quickly identify and isolate the combinations and patterns of interest.

Sample Data

You can pick your own data values, but these were created to make the examples work. I chose the theme of MARKETING, where the data pulled together are different attributes our fictional company has gathered about their customers: customer name, age, hh_size (Household Size), The scoring results of some bench marked survey, DMA (Demographic Marketing Area) Region and the date the customer was last contacted.

Test Marketing Data

Defined Boolean Arguments Using an Oracle Package Structure

The initial design is to calculate the business logic through an Oracle PL/SQL Package Object. For example, in the OP:

 select * from some_table
 where .. and .. or .. or.. or..
  or.. or.. or.. or.. or.. or.. (1000 times)

Each blank is a separate Oracle function call from within the package(s). The result is represented as a column value for each record of attributes that are evaluated.

 create or replace package ZZ_PKG_MARKETING_DEMO as

 c_result_true   constant   pls_integer:= 1;
 c_result_false  constant   pls_integer:= 0;

 cursor attrib_cur is
    select data_id, name, age, hh_size, survey_score, dma_region,
       last_contact
       from zz_data_attributes;

 TYPE attrib_record_type IS RECORD (  
    data_id   zz_data_attributes.data_id%TYPE,
    name      zz_data_attributes.name%TYPE,
    age       zz_data_attributes.age%TYPE,
    hh_size   zz_data_attributes.hh_size%TYPE,
    survey_score   zz_data_attributes.survey_score%TYPE,
    dma_region     zz_data_attributes.dma_region%TYPE,
    last_contact   zz_data_attributes.last_contact%TYPE
    );

 function evaluate_cond_one (
    p_attrib_rec   attrib_record_type) return pls_integer;

 function evaluate_cond_two (
    p_attrib_rec   attrib_record_type) return pls_integer;

 function evaluate_cond_three (
    p_attrib_rec   attrib_record_type) return pls_integer;

 function evaluate_cond_four (
    p_attrib_rec   attrib_record_type) return pls_integer;

 function evaluate_cond_five (
    p_attrib_rec   attrib_record_type) return pls_integer;

 procedure main_driver;

 end;



 create or replace package body "ZZ_PKG_MARKETING_DEMO" is

 function evaluate_cond_one (
    p_attrib_rec   attrib_record_type) return pls_integer
 as
 begin
  -- Checks if person is from a DMA Region in California.
   IF p_attrib_rec.dma_region like 'CA%'
     THEN return c_result_true;
     ELSE return c_result_false;
     END IF;

 end EVALUATE_COND_ONE;


 function evaluate_cond_two (
    p_attrib_rec   attrib_record_type) return pls_integer
 as

 c_begin_age_range   constant   zz_data_attributes.age%TYPE:= 20;
 c_end_age_range     constant   zz_data_attributes.age%TYPE:= 35;

 begin
 -- Part 1 of 2 Checks if person belongs to the 20 to 35 years age bracket
  IF p_attrib_rec.age between c_begin_age_range and c_end_age_range
     THEN return c_result_true;
     ELSE return c_result_false;
     END IF;

 end EVALUATE_COND_TWO;


 function evaluate_cond_three (
    p_attrib_rec   attrib_record_type) return pls_integer
 as

 c_lowest_age   constant   zz_data_attributes.age%TYPE:= 45;

 begin
 -- Part 2 of 2 Checks if person is from age 45 and up demographic.
  IF p_attrib_rec.age >= c_lowest_age
     THEN return c_result_true;
     ELSE return c_result_false;
     END IF;

 end EVALUATE_COND_THREE;


 function evaluate_cond_four (
    p_attrib_rec   attrib_record_type) return pls_integer
 as

 c_cutoff_score  CONSTANT  zz_data_attributes.survey_score%TYPE:= 1200;

 begin
 -- Checks if person's survey score is higher than c_cutoff_score
  IF p_attrib_rec.survey_score >= c_cutoff_score
     THEN return c_result_true;
     ELSE return c_result_false;
     END IF;

 end EVALUATE_COND_FOUR;


 function evaluate_cond_five (
    p_attrib_rec   attrib_record_type) return pls_integer
 as

 c_last_contact_period  CONSTANT  pls_integer:= -750;

 -- Note current date is anchored to a static value so the data output
 --   in this example will still work regardless of how old this post
 --   may get.

 c_current_date   CONSTANT   zz_data_attributes.last_contact%TYPE:=
   to_date('03/25/2014','MM/DD/YYYY');

 begin
 -- Checks if person's last contact date has been in the last 750
 -- days.
  IF p_attrib_rec.last_contact >= 
        (c_current_date + c_last_contact_period)
     THEN return c_result_true;
     ELSE return c_result_false;
     END IF;

 end EVALUATE_COND_FIVE;


 procedure MAIN_DRIVER
 as

     v_rec_attr   attrib_record_type;
     v_rec_cond   zz_condition_results%ROWTYPE;

 begin
    for i in attrib_cur
    loop
    -- Set the input record variable with the attribute values queried by the
    -- current cursor.

        v_rec_attr.data_id := i.data_id;
        v_rec_attr.name    := i.name;
        v_rec_attr.age     := i.age;
        v_rec_attr.hh_size := i.hh_size;
        v_rec_attr.survey_score := i.survey_score;
        v_rec_attr.dma_region := i.dma_region;
        v_rec_attr.last_contact := i.last_contact;

    -- Set each condition column value equal to their matching package function.

        v_rec_cond.cond_one  := evaluate_cond_one(p_attrib_rec => v_rec_attr);
        v_rec_cond.cond_two  := evaluate_cond_two(p_attrib_rec => v_rec_attr);
        v_rec_cond.cond_three:= evaluate_cond_three(p_attrib_rec => v_rec_attr);
        v_rec_cond.cond_four := evaluate_cond_four(p_attrib_rec => v_rec_attr);
        v_rec_cond.cond_five := evaluate_cond_five(p_attrib_rec => v_rec_attr);

        INSERT INTO zz_condition_results (data_id, cond_one, cond_two,
           cond_three, cond_four, cond_five)
        VALUES 
           ( v_rec_attr.data_id,
             v_rec_cond.cond_one,
             v_rec_cond.cond_two,
             v_rec_cond.cond_three,
             v_rec_cond.cond_four,
             v_rec_cond.cond_five );

end loop;
COMMIT;

end MAIN_DRIVER;

end "ZZ_PKG_MARKETING_DEMO";

PL/SQL Notes: Some may not be familiar the CUSTOM DATA TYPES such as the RECORD VARIABLE TYPE defined within the package in procedure MAIN_DRIVER. They provide easier to handle and reference identification of the data being processed.

Boolean Arithmetic in Plain English (well, sort of)

The CURSOR Named ATTRIB_CUR can be modified to operate on a single record or a smaller input data set. For now, invoke the MAIN_DRIVER procedure to process all the records in the attributes data source (again, this doesn't have to be a single table).

 BEGIN
    ZZ_PKG_MARKETING_DEMO.MAIN_DRIVER;
 END;

Now that each example condition has been evaluated for all the sample records, there are several simpler pathways to evaluating the boolean values, currently captured as values of "1" (for TRUE) and "0" (for FALSE).

If only one of this series of conditions need to be met (as in a long chain of OR operators), then the WHERE clause should look something like this:

 WHERE COND_ONE = 1 OR COND_TWO = 1 OR COND_THREE = 1 OR COND_FOUR = 1 OR COND_FIVE = 1

A shorthand approach could be:

 WHERE (COND_ONE + COND_TWO + COND_THREE + COND_FOUR + COND_FIVE) > 0

What does this buy? There are performance gains by processing an otherwise static evaluation (the custom conditions) at the time that the data record is populated. One good reason is that each subsequent query that asks about this criteria will not need to crunch through the business logic again. We also leverage an advantage through a decision value with a very, very, very low cardinality (TWO!)

The second "shorthand" example of the WHERE filter criteria is a clue about how the final approach will manage "thousands" of Boolean evaluations.

Scalability: How to Do This Several Thousand More Times in a Row

It would be impractical to assume this approach could scale up to the magnitude presented in the OP. The final question: How can this solution apply for an N thousand chain of boolean values?

Hint: PIVOT your results.

Expandable Table Design for Lots of Boolean Conditions

Pivoted Table Design

Here is also a mock-up of the table with the way the sample data would fit into it:

Organization of Data From Multiple Boolean Conditions

The SQL needed to fetch a multiple OR relation between the five sample conditions can be accomplished through an aggregation query:

 -- For multiple OR relations:

 SELECT DATA_ID
   FROM ZZ_CONDITION_PIVOT
  GROUP BY DATA_ID
 HAVING SUM(RESULT) > 0

Veterans will probably note this syntax can be further simplified with the use of database supported ANALYTICAL FUNCTIONS.

This design should be low maintenance with any number of boolean conditions introduced during or after the implementation. The table designs should remain the same throughout.

Let me know your thoughts, it looks like the discussion has moved on to other issues and contributors so this is probably long enough to get you started. Onward!

Where to process data? Db or locally?

5 Answers5

SQL Queries on ORACLE Using Multiple Boolean Operators

Demonstration Schema and Test Data

Boolean Arithmetic in Plain English (well, sort of)

Scalability: How to Do This Several Thousand More Times in a Row