2

I am trying to populate our "short description" field using the "full description" field. Basically, I want to set the ShortDescription column equal to the first three sentences from the FullDescription column.

I know how I would do this in C#, but I am having a little trouble getting it done in my SQL query. I don't care about performance- as this query will only be ran one time to generate this temporary data. So, any and all solutions will work just find for us!

My attempt:

UPDATE Product
    SET ShortDescription = (
        CASE
            WHEN (LEN(FullDescription) - LEN(REPLACE(FullDescription, '.', ''))) >= 3 THEN
                (
                    SELECT
                        LEFT(str, pos)
                    FROM (
                        SELECT
                            FullDescription AS str,
                            CHARINDEX('.', FullDescription) AS pos
                    ) x
                )
            ELSE
                FullDescription
        END
    )
WHERE FullDescription IS NOT NULL;

Unfortunately, the above query only gets the first sentence. I can't seem to figure out how to find the CHARINDEX of the third period. Anyone know a clean and easy way to locate this character?

Also, am I correct in my assumption that the periods are really the only way to identify sentences? I am concerned that (in some rare occasions), that there may be decimals in the sentences, which would provide some awful descriptions like: "This product is great. It has awesome features. It is 2."...

Any direction or feedback is greatly appreciated! Thank you!

Derek Foulk
  • 1,892
  • 1
  • 19
  • 37
  • 1
    If you know how to do this in C#, then why aren't you? – Joseph Young Oct 05 '16 at 19:04
  • What version of sql-server are you using? – smj Oct 05 '16 at 19:14
  • If you do not care about performance, then use CURSOR. – DVT Oct 05 '16 at 19:14
  • @smj - 2014 and 2008 R2 – Derek Foulk Oct 05 '16 at 19:14
  • @JosephYoung - Because I would like to learn how to do this in T-SQL... – Derek Foulk Oct 05 '16 at 19:15
  • @DVT - I will check into that- thank you! – Derek Foulk Oct 05 '16 at 19:15
  • @derekmx271 take a look at this question. Solution could involve splitting the FullDescription into sentences on the period character. To your point about decimals, using this method you may also want to concatenate sentences that don't start with a space. http://stackoverflow.com/questions/10581772/how-to-split-a-comma-separated-value-to-columns – smj Oct 05 '16 at 19:18
  • Shame you don't have 2016 :), string_split is available https://msdn.microsoft.com/en-nz/library/mt684588.aspx – smj Oct 05 '16 at 19:19
  • @DVT how is a cursor going to help here? I realize they aren't concerned about performance but just saying "use a cursor" doesn't provide any information. – Sean Lange Oct 05 '16 at 19:19
  • @smj - I like it. I seems like a really dirty way to get this done, but hey- don't care as long as it works lol. – Derek Foulk Oct 05 '16 at 19:21
  • @smj - Yeah, we aren't quite ready to upgrade yet :) – Derek Foulk Oct 05 '16 at 19:22
  • How wide is the `ShortDescription` column? What if the first 3 sentences is longer than that column? – Solomon Rutzky Oct 05 '16 at 19:37
  • There are far better things you can learn to do than parsing out sentences in SQL, for crying out loud. – ErikE Oct 05 '16 at 21:50
  • @ErikE - Thank you for your opinion, but I did not just decide to learn how to do this for the heck of it. I have my reasons for wanting to know how to do this. – Derek Foulk Oct 11 '16 at 19:08
  • @derekmx271 I'm sure you do! I just wonder if you're having an A-B problem (you ask for A, but you really want B, only you don't know enough to realize this). – ErikE Oct 11 '16 at 19:09

6 Answers6

3

You could use a true regular expression, which makes this task much easier. Regular Expressions are not natively supported in T-SQL, but you can use SQLCLR to access them via .NET, in which case they can be tied directly into the UPDATE statement. For example:

DECLARE @Pattern NVARCHAR(4000) = 
                           N'((?:Mr\.|Ms\.|Mrs\.|Sr\.|Jr\.|Dr\.|.)+?[.!?](?:\s+|$)){1,3}';

SELECT SQL#.RegEx_MatchSimple4k(tmp.[txt], @Pattern, 1, NULL)
FROM   (VALUES (N'Sentence uno!  Two, I think.  Only 2.3 till 3:12 A.M. Numero 4. Y five.'),
               (N'First one?      Second one.'),
               (N'Hello, this is Dr. Zhivago. Nice to meet you! I''m Mr. Mister. Really?')
       ) tmp(txt);

Returns:

Sentence uno!  Two, I think.  Only 2.3 till 3:12 A.M. 
First one?      Second one.
Hello, this is Dr. Zhivago. Nice to meet you! I'm Mr. Mister. 

Some notes:

  • You can get SQLCLR RegEx objects from several places. One source of pre-compiled RegEx functions is the SQL# SQLCLR library that I created. The free version contains most of the RegEx functions, including the one used in the example above.

  • The Regular Expression:

    • looks for "sentences" that end in any of the following characters: ., ?, and !.
    • a "sentence", in this context, is: the fewest number of occurrences of "Mr.", "Ms.", "Mrs.", "Sr.", "Jr.", "Dr.", or any other character, prior to finding a period, exclamation mark, or question mark.
    • can handle the string containing fewer than 3 "sentences" (as shown in the example)
    • can handle periods used as decimal places or in acronyms (as shown in the example)
  • I cannot think of any programmatic way to know that abbreviations of titles (e.g. Mr., Ms., Mrs., Dr., etc) in general are not the end of a sentence, outside of having a list of them to check against. I provided a short list in the pattern shown above, which can extended rather easily. But those are somewhat easy as they never end a sentence. You can also have abbreviations of units of measurement (e.g. lbs.) that can be either in the middle of a sentence or at the end.

Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
  • @derekmx271 No problem. Glad it works :). I had forgotten until just now that I was going to try to update the Regular Expression to have a list of common title abbreviations -- Mr., Mrs., Ms., Dr., etc -- to reduce the number of edge-cases. Is that still a desire? Not sure how easy/hard it is to do, though I believe it is possible. But if you don't need/want it, then I won't worry about it. – Solomon Rutzky Oct 11 '16 at 19:13
  • @derekmx271 I realize that you switched the accepted answer, which is fine, but I still figured out how to handle title abbreviations and updated my answer with the improved RegEx pattern. – Solomon Rutzky Oct 11 '16 at 19:58
  • Yeah, both of you guys had answers that worked great for me. I had to pick one or the other- and I did so only because John Cappelletti's answer does not require 3rd party libraries. Some folks (like me) are not always able to utilize these libs. If I could mark both answers- I would have! – Derek Foulk Oct 11 '16 at 20:15
  • Your improved pattern helped with some of the one-off values. Thank you for your solution and your awesome library! – Derek Foulk Oct 11 '16 at 20:16
  • @derekmx271 Yer welcome. And I appreciate the sentiment of wanting to pick both, and I completely understand that between two otherwise similarly good options, you go with the one that has less dependencies. It was a good opportunity for me to try something new with RegEx. And if something _can_ be done in T-SQL, then go with that. I didn't think it was possible without TONS of code. Good solution by John. Glad you like [SQL#](http://SQLsharp.com/) :-). – Solomon Rutzky Oct 11 '16 at 20:56
2

I have a TVF which may help. If you don't want the UDF, the code can easily be ported into the Cross Apply.

I should note. This delimiter is a period followed by a space. Just now thinking it would not capture other punctuation (i,e. !?)

Declare @String varchar(max) ='This is sentance one. This is sentance two.  This is Sentence 3.  This is sentecne 4.'
Declare @YourTable table (ID int,FullDescription varchar(max))
Insert Into @YourTable values
(1,'Some sentance with a decimal like 25.26 is OK. Sentance number two.  Sentance number 3. Sentance number 4 would not be included.'),
(2,'I know how I would do this in C#.  I am having a little trouble getting it done in my SQL query. I don''t care about performance. This query will only be ran one time.')

Select A.*
      ,B.ShortDescription
 From @YourTable A
 Cross Apply (Select ShortDescription=concat(Pos1,'. ',Pos2,'. ',Pos3,'.') From [dbo].[udf-Str-Parse-Row](A.FullDescription,'. ')) B

Returns enter image description here

The UDF if desired

CREATE FUNCTION [dbo].[udf-Str-Parse-Row] (@String varchar(max),@Delimiter varchar(10))
Returns Table 
As
Return (
    Select Pos1 = xDim.value('/x[1]','varchar(max)')
          ,Pos2 = xDim.value('/x[2]','varchar(max)')
          ,Pos3 = xDim.value('/x[3]','varchar(max)')
          ,Pos4 = xDim.value('/x[4]','varchar(max)')
          ,Pos5 = xDim.value('/x[5]','varchar(max)')
          ,Pos6 = xDim.value('/x[6]','varchar(max)')
          ,Pos7 = xDim.value('/x[7]','varchar(max)')
          ,Pos8 = xDim.value('/x[8]','varchar(max)')
          ,Pos9 = xDim.value('/x[9]','varchar(max)')
     From (Select Cast('<x>' + Replace(@String,@Delimiter,'</x><x>')+'</x>' as XML) as xDim) A
)
--Select * from [dbo].[udf-Str-Parse-Row]('Dog,Cat,House,Car',',')
--Select * from [dbo].[udf-Str-Parse-Row]('John Cappelletti',' ')
John Cappelletti
  • 79,615
  • 7
  • 44
  • 66
2

Try this simple trick:

UPDATE Product
    SET ShortDescription = (
        CASE
            WHEN (LEN(FullDescription) - LEN(REPLACE(FullDescription, '.', ''))) >= 3 THEN
                (
                    SELECT
                        LEFT(str, pos)
                    FROM (
                        SELECT
                            FullDescription AS str,
                            CHARINDEX('.', FullDescription,
                               CHARINDEX('.', FullDescription,
                                  CHARINDEX('.', FullDescription)+1)+1) AS pos
                    ) x
                )
            ELSE
                FullDescription
        END
    )
WHERE FullDescription IS NOT NULL;

Explanation: the T-SQL CHARINDEX function accepts an optional parameter indicating the index at which to start searching. So, by nesting three of these calls, the result of each nested call will be used as the starting point for the search for the next one. The most nested call will find the first period, then we advance one character beyond that and search for the second, then again for the third.

This would not be the right answer if you wanted many more than three sentences, and it would create errors without the guard clause you already have with the CASE structure, but it should function given your basic strategy.

The basic strategy of looking for periods doesn't work well if the description is not written in conversational English; a description including a reference to source code, or an ellipsis, will break the model. You could switch to searching for ". ", but this requires the three sentences to be part of the same paragraph along with a fourth sentence. You'd need to upgrade to a Regex search matching a period followed by any whitespace character (including a newline) or by the end of the string in order for this to really work well.

KeithS
  • 70,210
  • 21
  • 112
  • 164
1

Here is the code that use CURSOR. (You can put the part that calculate the ShortDescription into a separate function to make it more readable).

DECLARE @fd NVARCHAR(500);
DECLARE @tmp NVARCHAR(500);
DECLARE @sd NVARCHAR(500);

DECLARE @count INT;
DECLARE @pos INT;

DECLARE update_cursor CURSOR FOR SELECT FullDescription FROM Product;

OPEN update_cursor;

FETCH NEXT FROM update_cursor INTO @fd;
WHILE (@@FETCH_STATUS= 0)
BEGIN
    SET @sd = '';
    IF (LEN(@fd) - LEN(REPLACE(@fd,'.','')) < 3)
    BEGIN
        SET @sd = @fd;
    END;
    ELSE
        BEGIN
            SET @tmp = @fd;
            SET @count = 1;
            SET @pos = 0;
            WHILE (@count < 4)
            BEGIN
                SET @pos = CHARINDEX('.',@tmp, @pos);
                SET @sd = CONCAT(@sd, SUBSTRING(@tmp,1,@pos));
                SET @tmp = SUBSTRING(@tmp,@pos+1,LEN(@tmp));
                SET @count = @count +1;
            END;
        END;

    UPDATE Product
    SET ShortDescription = @sd
    WHERE CURRENT OF update_cursor;

    FETCH NEXT FROM update_cursor INTO @fd;
END;

CLOSE update_cursor;
DEALLOCATE update_cursor;
DVT
  • 3,014
  • 1
  • 13
  • 19
  • This one assume that '.' is the only sentence separator, but if you already have the definition of a sentence in C#, I guess it should not be a problem putting it here. – DVT Oct 05 '16 at 22:10
  • 1
    Nice, I tried this (and made some changes for the other sentence delimiters) and this seems to work great. I appreciate your time! – Derek Foulk Oct 11 '16 at 19:15
  • Thanks. I'm glad you found it helpful. – DVT Oct 11 '16 at 19:46
0

This is so dirty, but give it a try.

SET NOCOUNT ON; 
DECLARE @t TABLE (i INT, sd NVARCHAR(MAX), ld NVARCHAR(MAX)); 
INSERT INTO @t VALUES (1, '', '1s. 2s. 3s. 4.');
INSERT INTO @t VALUES (2, '', '4s. 5s. 6s. 7.');
INSERT INTO @t VALUES (3, '', '6s. 7.');

DECLARE @sd NVARCHAR(MAX)
DECLARE @ld NVARCHAR(MAX)
DECLARE @i INT, @a INT; 


DECLARE @ss TABLE (s NVARCHAR(MAX)); 
DECLARE @s NVARCHAR(MAX); 

WHILE (SELECT COUNT(*) FROM @t) > 0
BEGIN
    SELECT TOP 1 @i = i, @sd = sd, @ld = ld FROM @t; 

    DELETE FROM @ss; 
    SET @a = 1

    WHILE LEN(@ld) > 0
    BEGIN
        IF @a > 3
            BREAK; 
        SET @s = LEFT(@ld, CHARINDEX('.', @ld));
        INSERT INTO @ss VALUES (@s); 
        SET @ld = REPLACE(@ld, @s, '');
        SET @a = @a + 1; 
    END

    WHILE (SELECT COUNT(*) FROM @ss) > 0
    BEGIN
        SELECT TOP 1 @s = s FROM @ss; 
        SET @sd = @sd + @s
        DELETE FROM @ss WHERE s = @s
    END

    PRINT @sd; 

    DELETE FROM @t WHERE i = @i; 
END
Uwe E.
  • 39
  • 3
  • Dirty is how I like my queries! (jk) I will give this a shot and report back for future peeps looking for something like this. Thank you! – Derek Foulk Oct 11 '16 at 19:16
0

The following user defined function will make use of SQL Server’s PATINDEX function to match the first occurrence of a period that is not followed by a number. This will avoid the scenario where a sentence contains a decimal value somewhere in the sentence.

CREATE FUNCTION GetFirstThreeSentences 
(
    @fullText NVARCHAR(MAX)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN

    DECLARE @finalPosition  INT = 0;
    DECLARE @patternIndex INT = 0;
    DECLARE @textLenght INT = LEN(@fullText);

    -- Get first sentence.
    DECLARE @currentSentencePosition INT = PATINDEX('%.[^0-9]%', @fullText);
    SET @finalPosition = @currentSentencePosition;
    DECLARE @remainingText NVARCHAR(MAX) = RIGHT(@fullText, @textLenght - @finalPosition);

    -- Get second sentence.
    SET @currentSentencePosition  = PATINDEX('%.[^0-9]%', @remainingText);
    SET @finalPosition = @finalPosition + @currentSentencePosition;
    SET @remainingText  = RIGHT(@fullText, @textLenght - @finalPosition);

    -- Get third sentence.
    SET @currentSentencePosition  = PATINDEX('%.[^0-9]%', @remainingText);
    SET @finalPosition = @finalPosition + @currentSentencePosition;
    SET @remainingText  = RIGHT(@fullText, @textLenght - @finalPosition);

    -- Return first three setences
    RETURN LEFT(@fullText, @finalPosition)

END
GO

The function can now be called as follow with in your query:

UPDATE Product
SET ShortDescription = dbo.GetFirstThreeSentences(FullDescription)
WHERE FullDescription IS NOT NULL;
Edmond Quinton
  • 1,709
  • 9
  • 10