4

From My other question, Using REGEXP_SUBSTR with Strings Qualifier, I'm trying to decide which approach would be better to use.

The Resulting Data set should only show the strings before the delimters PLE, # and ALL in proper order. The Current Query already in the package is something like this (DDL and DML is in the Bottom of the Post):

SELECT  DATA1
      , DECODE(SIGN(0 - instr(DATA1, 'PLE')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'PLE') - 1)) GET_DATA_TILL_FIRST_PLE
      , DECODE(SIGN(0 - instr(DATA1, '#')), -1, SUBSTR(DATA1, 1, instr(DATA1, '#') - 1)) GET_DATA_TILL_FIRST_NUM_SIGN
      , DECODE(SIGN(0 - instr(DATA1, 'ALL')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'ALL') - 1)) GET_DATA_TILL_FIRST_ALL
      , NVL(DECODE(SIGN(0 - instr(DATA1, 'PLE')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'PLE') - 1), 0,
        DECODE(SIGN(0 - instr(DATA1, '#')), -1, SUBSTR(DATA1, 1, instr(DATA1, '#') - 1), 0,
        DECODE(SIGN(0 - instr(DATA1, 'ALL')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'ALL') - 1), DATA1), DATA1), DATA1), DATA1) PUT_THEM_ALL_TOGETHER    
FROM    table_x;    

Which Results into a Data Set below:

DATA1                   | GET_DATA_TILL_FIRST_PLE | GET_DATA_TILL_FIRST_#_SIGN  | GET_DATA_TILL_FIRST_ALL    |  PUT_THEM_ALL_TOGETHER
----------------------- | ----------------------- | --------------------------- | -------------------------- |  ----------------------
STRING_EXAMPLE          | STRING_EXAM             |                             |                            |  STRING_EXAM
TREE_OF_APPLES          | TREE_OF_AP              |                             |                            |  TREE_OF_AP
FIRST_EXAMPLE           | FIRST_EXAM              |                             |                            |  FIRST_EXAM
IMPLEMENTATION          | IM                      |                             |                            |  IM
PARIS                   |                         |                             |                            |  PARIS
PLEONASM                |                         |                             |                            |  PLEONASM
XXXX 1                  |                         |                             |                            |  XXXX 1 
XXXX YYYYYY 2 FFFFFFFFF |                         |                             |                            |  XXXX YYYYYY 2 FFFFFFFFF
XXXX YYYYYY 5FFFFFFFFF  |                         |                             |                            |  XXXX YYYYYY 5FFFFFFFFF
OPOPOPOPO #09090 APPLE  | OPOPOPOPO #09090 AP     | OPOPOPOPO                   | OPOPOPOPO #                |  OPOPOPOPO #09090 AP
OPOPOPOPO BALL#         |                         | OPOPOPOPO BALL              | OPOPOPOPO B                |  OPOPOPOPO BALL
BALL IS #LIFE           |                         | BALL IS                     | B                          |  BALL IS     

PS. I only need column PUT_THEM_ALL_TOGETHER but i included the other columns as well do add context.

I find the query bit confusing and hard to read so i tried using REGEXP_SUBSTR and with the suggestion of @vkp, i came up with the below query that results into the same data set above.

SELECT  DATA1
  , REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1) GET_DATA_TILL_FIRST_PLE
  , REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1) GET_DATA_TILL_FIRST_#_SIGN
  , REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1) GET_DATA_TILL_FIRST_ALL
  , COALESCE(REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1),
             REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1),
             REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1),
             DATA1) PUT_THEM_ALL_TOGETHER
FROM    table_x;     

However, From @MathGuy's Answer, it seems that INSTR and SUBSTR is much more efficient. i tested this to some extent and here's what i got:

Using INSTR and SUBSTR:

SET TIMING ON;    
BEGIN
    UPDATE  table_x
    SET     DATA2 = NVL(DECODE(SIGN(0 - instr(DATA1, 'PLE')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'PLE') - 1), 0,
                    DECODE(SIGN(0 - instr(DATA1, '#')), -1, SUBSTR(DATA1, 1, instr(DATA1, '#') - 1), 0,
                    DECODE(SIGN(0 - instr(DATA1, 'ALL')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'ALL') - 1), DATA1), DATA1), DATA1), DATA1);    
    ROLLBACK;        
END;
/            

PL/SQL procedure successfully completed.
Elapsed: 00:00:00.234

Using REGEXP_SUBSTR:

SET TIMING ON;  
BEGIN    
    UPDATE  table_x
    SET     DATA2 = COALESCE(REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1)
                            ,REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1)
                            ,REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1)
                            ,DATA1);
    ROLLBACK;        
END;
/    

PL/SQL procedure successfully completed.
Elapsed: 00:00:00.236

Although this is a very limited test data, it shows that the combination of INSTR and SUBSTR is a little bit faster than REGEXP_SUBSTR. Would it be negligible to use REGEXP_SUBSTR instead of INSTR and SUBSTR for readability's sake?

DML and DDL:

create table table_x 
(
    data1 varchar2(100)    
   ,data2 varchar2(100)
);

INSERT INTO table_x (DATA1) VALUES ('STRING_EXAMPLE');
INSERT INTO table_x (DATA1) VALUES ('TREE_OF_APPLES');
INSERT INTO table_x (DATA1) VALUES ('FIRST_EXAMPLE');  
INSERT INTO table_x (DATA1) VALUES ('IMPLEMENTATION');   
INSERT INTO table_x (DATA1) VALUES ('PARIS');            
INSERT INTO table_x (DATA1) VALUES ('PLEONASM');        

INSERT INTO table_x (DATA1) VALUES ('XXXX 1');   
INSERT INTO table_x (DATA1) VALUES ('XXXX YYYYYY 2 FFFFFFFFF'); 
INSERT INTO table_x (DATA1) VALUES ('XXXX YYYYYY 5FFFFFFFFF'); 

INSERT INTO table_x (DATA1) VALUES ('OPOPOPOPO #09090 APPLE'); 
INSERT INTO table_x (DATA1) VALUES ('OPOPOPOPO BALL#'); 
INSERT INTO table_x (DATA1) VALUES ('BALL IS #LIFE');   

Thank you.

Community
  • 1
  • 1
Migs Isip
  • 1,450
  • 3
  • 23
  • 50
  • 3
    `Elapsed: 00:00:00.236` - this is 236 miliseconds vs 234 miliseconds. It seems that you tested it on a veeeery limited data set (less than a few thousands records). Such a test is unreliable - for DBMS a few thousands records is nothing, this difference may l come as wel from network latency between DBMS and your client. Please prepare a data set with **a few million rows**, not a few hundred or thousands, and repeat your test again and measure. I bet that regexp will lose.. – krokodilko Dec 15 '16 at 06:13
  • hi @krokodilko, yes, that is acknowledged in the post above. – Migs Isip Dec 15 '16 at 07:07
  • @krokodilko - that makes perfect sense, but "what to do now" depends also on the OP's real-life problem. If in his real-life problem he has only 5,000 rows in the table, he may not care which solution "would be best" on a table with 10 million rows. If he had a table with 5,000 rows, he would care MUCH more about readability and maintainability than about performance. –  Dec 15 '16 at 13:01
  • @krokodilko - with that said, thank you for making me curious enough to run some tests. See the separate Answer I just posted for the results. –  Dec 15 '16 at 14:22

5 Answers5

6

I already posted an answer showing how to solve this problem using INSTR and SUBSTR the right way.

In this "Answer" I address the other question - which solution is more efficient. I will explain the test below, but here is the bottom line: the REGEXP solution takes 40 times longer than the INSTR/SUBSTR solution.

Setup: I created a table with 1.5 million random strings (all exactly eight characters long, all upper-case letters). Then I modified 10% of the strings to add the substring 'PLE', another 10% to add a '#' and another 10% to add 'ALL'. I did this by splitting an original string at position mod(rownum, 9) - that is a number between 0 and 8 - and concatenating 'PLE' or '#' or 'ALL' at that position. Granted, not the most efficient or elegant way to get the kind of test data we needed, but that is irrelevant - the point is just to create the test data and use it in our tests.

So: we now have a table with just one column, data1, with some random strings in 1.5 million rows. 10% each have the substring PLE or # or ALL in them.

The test consists in creating the new string data2 as in the original post. I am not inserting the result back in the table; regardless of how data2 is calculated, the time to insert it back in the table should be the same.

Instead, I put the main query inside an outer one that computes the sum of the lengths of the resulting data2 values. This way I guarantee the optimizer can't take shortcuts: all data2 values must be generated, their lengths must be measured, and then summed together.

Below are the statements needed to create the base table, which I called table_z, then the queries I ran.

create table table_z as
select dbms_random.string('U', 8) as data1 from dual
connect by level <= 1500000;

update table_z 
set data1 = case
when rownum between      1 and 150000 then substr(data1, 1, mod(rownum, 9)) 
                               || 'PLE' || substr(data1, mod(rownum, 9) + 1)
when rownum between 150001 and 300000 then substr(data1, 1, mod(rownum, 9)) 
                               || '#'   || substr(data1, mod(rownum, 9) + 1)
when rownum between 300001 and 450000 then substr(data1, 1, mod(rownum, 9)) 
                               || 'ALL' || substr(data1, mod(rownum, 9) + 1)
          end
where rownum <= 450000;

commit;

INSTR/SUBSTR solution

select sum(length(data2))
from (
select data1, 
       case 
         when instr(data1, 'PLE', 2) > 0 then substr(data1, 1, instr(data1, 'PLE', 2) - 1)
         when instr(data1, '#'  , 2) > 0 then substr(data1, 1, instr(data1, '#'  , 2) - 1)
         when instr(data1, 'ALL', 2) > 0 then substr(data1, 1, instr(data1, 'ALL', 2) - 1)
         else data1 end
       as data2
from   table_z
);

SUM(LENGTH(DATA2))
------------------
          10713352

1 row selected.

Elapsed: 00:00:00.73

REGEXP solution

select sum(length(data2))
from (
select data1, 
       COALESCE(REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1)
                            ,REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1)
                            ,REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1)
                            ,DATA1)
       as data2
from   table_z
);

SUM(LENGTH(DATA2))
------------------
          10713352

1 row selected.

Elapsed: 00:00:30.75

Before anyone suggests these things: I repeated both queries several times; the first solution always runs in 0.75 to 0.80 seconds, the second query runs in 30 to 35 seconds. More than 40 times slower. (So it is not a matter of the compiler/optimizer spending time to compile the query; it is really the execution time.) Also, this has nothing to do with reading the 1.5 million values from the base table - that is the same in both tests, and it takes far less time than the processing. In any case, I ran the INSTR/SUBSTR query first, so if there was any caching, the REGEXP query would have been the one to benefit.

Edit: I just figured out one inefficiency in the proposed REGEXP solution. If we anchor the search pattern to the beginning of the string (for example '^(.+?)PLE', notice the ^ anchor), the runtime for the REGEXP query drops from 30 seconds to 10 seconds. Apparently the Oracle implementation isn't smart enough to recognize this equivalence and tries searches from the second character, from the third, etc. Still the execution time is almost 15 times longer; 15 < 40 but that is still a very large difference.

  • I think you are in verbose mode today, e.g. `@mathguy --verbose` – Tim Biegeleisen Dec 15 '16 at 15:05
  • @TimBiegeleisen - :-) I often am. I am generally known to say too much rather than not enough. –  Dec 15 '16 at 15:08
  • "Apparently the Oracle implementation isn't smart enough to recognize this equivalence" as much as I love a good dig at Oracle, in their defence `(.+?)PLE` is not equivalent to `^(.+?)PLE`. `^` is the beginning of the line so defines a stricter subset of the other pattern. RegEx is powerful but also needs its own compilation to execute and will always be slower than a basic `like` or `instr` literal character matching function. It's also always going to be faster if you can match from the first character, because it rules out mismatches faster. More so if there's an index on that string. – Davos Aug 22 '19 at 06:30
  • @Davos - The leading `(.+?)` makes both your observations wrong. If there is a match starting at the third character, then the leading `(.+?)` part of that match can be extended to include the first two characters of the string, meaning there was a match also from the first character. There can't be a match from the third character if there isn't one from the first also. And your comment about the index... how can an index help if the first part of the reg exp is `(.+?)`? "Let the sleeping dogs lie", as they say. –  Aug 22 '19 at 12:04
  • I see your point, they are equivalent in this case. If the pattern allowed for newline characters (which I realise it doesn't) then the `^` would exclude any matches after the newline which would not be equivalent; perhaps this is an optimisation they could include, if it can be determined that the rest of the pattern makes no allowance for newlines. Regarding indexes I was making a general point that if you can string match from the beginning without leading wildcards, then indexes can be used, most regex pattern matching will preclude index seeks. Yes, I wish I had. – Davos Aug 22 '19 at 13:20
2

Both the approach using INSTR and the one using REGEXP_SUBSTR in practice are doing very similar string manipulations.

Let's compare the second select term in both the query using base string functions versus the one using a regular expression. Here is that term from your first query which just uses INSTR:

DECODE(SIGN(0 - instr(DATA1, 'PLE')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'PLE') - 1))

Assuming that the DATA1 column matches, this would require two calls to INSTR. Without knowing the implementation details, I would assume that each call to INSTR would involve making a single left to right pass along the string in DATA1 until 'PLE' be found or until the end of the string is reached. In both cases, INSTR simply requires a single pass across the string in DATA1.

Here is the same select term but using a regular expression:

REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1)

I am not too familiar with all the parameters you used in your call to REGEXP_SUBSTR, but from what I can see this is simply doing a vanilla flavored regex capture of everything in the DATA1 column up until the first occurrence of 'PLE' happens. Again, this would also only require a single pass through the DATA1 string, and there is nothing in your regex such as lookaheads/lookbehinds or anything else which would require anything more than a single pass.

All of your selects look pretty similar to this, so I would assume that the performance of using base Oracle string functions as compared to regex would be fairly similar, and your own performance tests seem to confirm this, at least for your particular data set.

In practice, very similar string manipulations are being done, only that at a high level they are being phrased a bit differently. Personally, I might opt for the regex solution because it is more readable and perhaps easier to maintain.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Nice speculation. Alas, actual testing shows the conclusion to be 100% wrong. On a table with 1.5 million `DATA1` values, of which 10% each contain the substrings `'PLE'`, `'#'` and `'ALL'`, the `REGEXP` solution takes 40 times longer than the `INSTR/SUBSTR` solution. (I posted a separate Answer describing the tests, which can be replicated by anyone who cares to.) The problem with the speculation you offered is that we don't know the details of how Oracle implements the different functions. I have no idea why `INSTR/SUBSTR` is so much faster. It just is. –  Dec 15 '16 at 14:19
  • As to readability, I am not convinced. I show in another answer how I would write the solution with `INSTR/SUBSTR`. To my eyes, it seems clearer - I can tell right away what it is supposed to do, and how it does it. But that is probably subjective - if one works with regular expressions all the time, they may find the `REGEXP` solution easier to read. –  Dec 15 '16 at 14:21
  • @mathguy Your comments are speculative because you didn't study how running time scales with the size of the data set, types of operations, etc. There may be an implicit overhead to doing _any_ regex in Oracle, which could explain what you are seeing. I believe my general analysis of the algorithms used in both cases to be sound. – Tim Biegeleisen Dec 15 '16 at 14:31
  • The question wasn't about algorithms (I don't know what algorithms Oracle uses, do you?) but about execution times. If your analysis was approximately right, then the result shouldn't depend on the size of the data - and certainly not to a factor of 40! My point was exactly what you concluded too: there IS a SIGNIFICANT overhead to doing *any* regex in Oracle (which is what I have read repeatedly before, and the reason I told the OP to stick with `INSTR/SUBSTR` in his original post, related to this one). –  Dec 15 '16 at 14:48
  • @mathguy But there is something to be said for having a query which is readable and maintainable. Regex can be much easier to read and maintain than base string functions. – Tim Biegeleisen Dec 15 '16 at 15:06
  • Totally agreed. I always say "use INSTR and SUBSTR, unless you can't write something that's easy to read and maintain. If you can't then use regexp even if performance is worse." In this thread - the query may take 30 seconds instead of less than a second, but if it takes even a few minutes longer to read and understand the query (and maintain it), still use the regexp solution. I just contend that in this case, the INSTR/SUBSTR solution I posted is just as easy to read as the REGEXP solution. (But it depends on the reader's experience with either type of functions.) –  Dec 15 '16 at 15:11
  • I just added a brief Edit to my reply regarding performance comparisons - I figured one way to make the REGEXP solution more efficient. See details there. Still almost 15 times slower (rather than 40 times slower). –  Dec 15 '16 at 20:20
1

The solution using INSTR and SUBSTR can be rewritten so it is just as easy (or easier) to read and maintain as the regexp-based solution. Regarding performance, you can test both versions and decide. NOTE: this solution will be faster than your attempt with INSTR and SUBSTR, because in the CASE expressions the successive branches of CASE are evaluated only until the first TRUE branch - the remaining branches are not evaluated. (And also because your solution has a very large number of unnecessary function calls - nvl, decode, sign, none of those are needed.)

select data1, 
       case when instr(data1, 'PLE') > 0 then substr(data1, 1, instr(data1, 'PLE') - 1)
            when instr(data1, '#')   > 0 then substr(data1, 1, instr(data1, '#'  ) - 1)
            when instr(data1, 'ALL') > 0 then substr(data1, 1, instr(data1, 'ALL') - 1)
            else data1 end
       as data2
from   table_x;

Adapt as appropriate for your UPDATE statement.

Edit: I missed the fact that if the initial result is NULL, then the expression should return the original string. This means that we should NOT search for 'PLE', '#' or 'ALL' at the beginning of the string. However, if is possible that the "search pattern" may be found at the beginning of the input string and then again later in the string - for example, 'ALL IN ALL'. So we need to start the INSTR search at the second character of data1, instead of at the first character. We use the third (optional) argument of INSTR for that.

Updated solution:

select data1, 
       case 
          when instr(data1, 'PLE', 2) > 0 then substr(data1, 1, instr(data1, 'PLE', 2) - 1)
          when instr(data1, '#'  , 2) > 0 then substr(data1, 1, instr(data1, '#'  , 2) - 1)
          when instr(data1, 'ALL', 2) > 0 then substr(data1, 1, instr(data1, 'ALL', 2) - 1)
          else data1 end
       as data2
from   table_x;
  • I'm not getting the result of `PLEONASM` in the data2 column, but this looks good. Also, the `APP` should be `ALL` but both has the same result. – Migs Isip Dec 15 '16 at 05:47
  • @MigsIsip - I edited to change 'APP' to 'ALL'. Regarding `PLEONASM`, I missed, all along, that you do not want to return `NULL` when you find `PLE` (or `#` or `ALL`) right at the beginning of the string (although I am not sure why not; that would make more sense). Anyway, that is very easy to accommodate - I added an **Edit** at the end of the answer. –  Dec 15 '16 at 13:27
1

Have you measured that you are having some sort of performance problem? What size datasets are you looking at? What sort of speedup are you hoping to achieve? If you are looking for, say, a 10% speedup in every query, then a difference between 234ms and 236ms is not going to make any difference, and you're wasting your time as a programmer.

Is your SQL query taking the significant amount of the time in running your code? If your app is 80% network transfer and 20% server side, and you manage to speed up your query 10%, then you've only sped up your app by 2%. It's better to spend your time reducing the time that it takes get data across the network.

Another way to think about it: Imagine you find that in real life you're spending more money than you make each month. You start thinking about "Well, maybe if when I go to Starbucks I'll get a small coffee instead of a large, I'll save 50 cents!", but what about that you have a $3,000 mortgage that eats up 60% of your take-home pay?

If you don't know what is slow in your app, you could be wasting your time on the wrong thing.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
-1

In defense of any RegEx...

It is usually a feature-rich, "catch-all" function that returns more information than something like "InStr()" and "SubStr()" can ever offer.

If you programmed something CLOSE to what RegEx offered, the VB6/VBA counter-part would be severely slower, on all accounts.

However, that being said...

When you DO have a real specific (single use) for a RegEx, it is usually going to be faster to hard-code your own textual processing.

What would be even faster is, instead of bending to "bad data formats", and trying to dance around them. You change the format to one that is better, to suit your actual needs. (In this case, adapting to the ability of the records data to be "found and isolated".)

One perfect example of this is the RTF format. You will see this "\sa430\sl700\slmult1". That is a horrible setup if you have to isolate the number from the text-indicator. A better format, often found in HTML, INI, and just about everywhere-else, is the following "\sa=430\sl=700\slmult=1". Now, the data is "better", for the needs of finding and isolating things. A simple RegEx for "\" then * then "=" then [0-9]*, is fast and easy. However, InStr("") followed by InStr("="), might be a bit faster.

What would be even faster is this...

MyArray = Split(MyData, "\") Resulting in... "sa=430", "sl=700", "slmult=1", "something", "somethingElse=bob can fly", "qj"

Now, for each index in the array, a simple InStr("=") is needed, but only if the LENGTH of the array-string is > 3, in this specific case, the smallest possible set is "XX=#". But you could skip the Len(MyArray(i)), since most used values, by majority, are > 3. (But this is something you would refine, when hard-coding. While with RegEx, it's already done, by the way it matched.)

You wouldn't normally "think" to use a function like "Split()" in a method like this, but that ONE function, is faster than "InStr()", simply because it doesn't track the "position" of the "find", it just jumps past it and dumps the prior data into the last cell of the growing MyArray() string. All the data is, additionally, already pre-isolated, which is a function that InStr and SubStr doesn't even offer. (InStr gives you the position and RegEx returns a collection of the "finds".)

It is always interesting to see some of these results in isolation. RegEx really wasn't intended for blinding speeds on massive data-sets with complex comparisons. The last was the first priority, followed by the second, and now, after years of use and refinement, the first has just about reached the greatest possible limit. It is what it is, for better or worse, there really is no other function quite like it, as a whole.

Though, since it is as refined as much as it is. It is now OUR job to manipulate the data to be as efficient and functional as the programs that have to deal with them. Sometimes that little bit of extra bloat is worth the gains, down the line. Just as super-refinement into micro-formats, ultimately just makes everyone's jobs harder for no real end-game gains. There are tools for compression, we no longer need to manually shrink this stuff down until even a program has a hard time understanding what it actually is.

This is why we save telephone numbers as 9-999-999-9999, instead of reducing that to 9999999999, then reducing that to something stupid like a 5 Byte value "XXXXX", which no-one can read, just to save 50% on 1% of a dataset, reducing it by 0.5%, when ZIP or GZIP or RAR, or ARJ, would have reduced it better as a raw string of "9-999-999-9999", which is even larger. (Not saving that redundant "-" is the first step in the process of data-formatting. You can't even enter a "-" in a phone, to dial a number. Maybe one dash, to mark the country-code, to isolate it from the number. "9-9999999999", which then tells you the "-" style-format, and lets you know if the number is incomplete. Like if it is missing an area-code, which is actually a critical component now, in many places.)