Leading spaces, trailing spaces, two or more spaces between the neighbouring words – these are the likely causes of the wrong results you are getting.
The functions LTRIM()
and RTRIM()
can help you eliminate the first two issues. As for the third one, you can use REPLACE(ExtractedText, ' ', ' ')
to replace double spaces with single ones, but I'm not sure if you do not have triple ones (in which case you'd need to repeat the replacing).
UPDATE
Here's a UDF that uses CTEs and ranking to eliminate extra spaces and then counts the remaining ones to return the quantity as the number of words:
CREATE FUNCTION fnCountWords (@Str varchar(max))
RETURNS int
AS BEGIN
DECLARE @xml xml, @res int;
SET @Str = RTRIM(LTRIM(@Str));
WITH split AS (
SELECT
idx = number,
chr = SUBSTRING(@Str, number, 1)
FROM master..spt_values
WHERE type = 'P'
AND number BETWEEN 1 AND LEN(@Str)
),
ranked AS (
SELECT
idx,
chr,
rnk = idx - ROW_NUMBER() OVER (PARTITION BY chr ORDER BY idx)
FROM split
)
SELECT @res = COUNT(DISTINCT rnk) + 1
FROM ranked
WHERE chr = ' ';
RETURN @res;
END
With this function your query will be simply like this:
SELECT fnCountWords(ExtractedText)
FROM EDDSDBO.Document
WHERE ID='100'
UPDATE 2
The function uses one of the system tables, master..spt_values
, as a tally table. The particular subset used contains only values from 0 to 2047. This means the function will not work correctly for inputs longer than 2047 characters (after trimming both leading and trailing spaces), as @t-clausen.dk has correctly noted in his comment. Therefore, a custom tally table should be used if longer input strings are possible.