2

I have a table (Data) that includes nine long text fields. It currently has 1,000,000 records and it's growing, so I would like to speed things up a bit.

I'm using a Table-Valued RegEx function (master.dbo.RegExMatches) to parse out words from these nine fields and placing those into a table (DataWordFullMap) that has the Record ID (Id in the example below), word (wordtoadd), column name (DE87 in the first query, DE150 in the second) and starting character location in the column (MatchIndex).

The function master.dbo.RegExMatches takes the parameters regex, column name, options. The current setup has to scan the table once for each field (each is in a separate query), rather than applying the function to each of the nine columns in one scan.

Is there an efficient way consolidating these CROSS APPLY statements into one query? Perhaps by adding an additional column to the CROSS APPLY result that has the name of the column used in the regex function? Some of the columns are mostly NULL, others have no NULL values, so it seems wasteful to do full scans for each of the nine, when some of the nine will have no results for most of the scan.

The following snippet shows two of the nine queries involved:

INSERT INTO DataWordFullMap
SELECT Id ,
       CAST ( Match AS nvarchar( 255 ))  AS wordtoadd ,
       'DE87' ,
       MatchIndex
  FROM
       Data CROSS APPLY master.dbo.RegExMatches( '[\w-[0-9ÿ_]]{2,}(-[\w-[0-9ÿ_]]{2,})?(''[\w-[0-9ÿ_]])?' , DE87 , master.dbo.RegExOptionEnumeration( 0 , 0 , 1 , 1 , 0 , 0 , 0 , 0 , 0 ));

INSERT INTO DataWordFullMap
SELECT Id ,
       CAST ( Match AS nvarchar( 255 ))  AS wordtoadd ,
       'DE150' ,
       MatchIndex
  FROM
       Data CROSS APPLY master.dbo.RegExMatches( '[\w-[0-9ÿ_]]{2,}(-[\w-[0-9ÿ_]]{2,})?(''[\w-[0-9ÿ_]])?' , DE150 , master.dbo.RegExOptionEnumeration( 0 , 0 , 1 , 1 , 0 , 0 , 0 , 0 , 0 ));
TT.
  • 15,774
  • 6
  • 47
  • 88
rsjaffe
  • 5,600
  • 7
  • 27
  • 39

1 Answers1

0
INSERT INTO datawordfullmap(
    -- ...
)
SELECT
    d.id,
    rems.wordtoadd
    rems.cn,
    rems.matchindex
FROM
    data AS d
    CROSS APPLY (
        SELECT 
            cn='DE87',
            wordtoadd=CAST(rem.match AS NVARCHAR(255)),
            rem.matchindex
        FROM
            master.dbo.RegExMatches (
                '[\w-[0-9ÿ_]]{2,}(-[\w-[0-9ÿ_]]{2,})?(''[\w-[0-9ÿ_]])?',
                d.DE87,
                master.dbo.RegExOptionEnumeration( 0 , 0 , 1 , 1 , 0 , 0 , 0 , 0 , 0 )
            ) AS rem
        UNION ALL
        SELECT 
            cn='DE150',
            wordtoadd=CAST(rem.match AS NVARCHAR(255)),
            rem.matchindex
        FROM
            master.dbo.RegExMatches (
                '[\w-[0-9ÿ_]]{2,}(-[\w-[0-9ÿ_]]{2,})?(''[\w-[0-9ÿ_]])?',
                d.DE150,
                master.dbo.RegExOptionEnumeration( 0 , 0 , 1 , 1 , 0 , 0 , 0 , 0 , 0 )
            ) AS rem
        -- UNION ALL
        -- ...
    ) AS rems;
TT.
  • 15,774
  • 6
  • 47
  • 88
  • Thanks! That works. _However_, I got quite a surprise when I compared the original and the edited version. The original took 107 seconds, and the consolidated version took 1046 seconds. When I looked at the server's performance monitor, the disk system seemed to be taking the greatest hit (it's using an SSD) with writes topping out at 110 MB/sec, out of which 100 MB/sec was for the database log file. Memory did not seem to be a problem. The database is set for simple recovery mode logging. – rsjaffe Feb 11 '16 at 17:13
  • @RoryJaffe I'm wondering what the execution plan is like for the consolidated version. I haven't installed this `master.dbo.RegExOptionEnumeration` and I have no sample data, so I can't really say what the reason is. – TT. Feb 11 '16 at 18:09
  • To develop the execution plan, I decided to set up a copy of the DataWordFullMap table so as to not overwrite the good data. I ran the edited version as insert into the new table, and the query took 108 seconds--only one second difference between the consolidated and the original query. I'm at a loss as to the difference. I did truncate DataWordFullMap in the earlier tests prior to running the query, so I would think that the two situations are similar, but it looks like it is not. The execution plan is the same as it was when I first put together the consolidated query for the earlier test. – rsjaffe Feb 11 '16 at 19:20
  • Here is a description of the actual plan for the consolidated query: The consolidated query has 9 sets of TVF→Computer Scalar. In parallel to that is a clustered index scan of the Data table. Those are joined in Nested Loops, then there is a compute scalar followed by the table insert. Then a sort (there is an index) followed by the Index Insert. The plan shows that most of the time was spent in the sort (cost 66%) – rsjaffe Feb 11 '16 at 19:24
  • @RoryJaffe Well the [actual execution plan](http://stackoverflow.com/a/7359705/243373) on the original table should provide some clues. It takes a bit of practice to analyze these and act on the indications it gives. Also, even in a perfect setting I think the consolidated version won't run faster than the individual statements... the amount of work is basically the same. Doesn't matter much if the work is done in individual statements. – TT. Feb 11 '16 at 19:26
  • Thank you. I think I learned my lesson about consolidating queries. – rsjaffe Feb 11 '16 at 19:45