SQL: problem word count with len()

Question

I am trying to count words of text that is written in a column of table. Therefor I am using the following query.

SELECT LEN(ExtractedText) - 
LEN(REPLACE(ExtractedText, ' ', '')) + 1 from EDDSDBO.Document where ID='100'.

I receive a wrong result that is much to high. On the other hand, if I copy the text directly into the statement then it works, i.e.

SELECT LEN('blablabla text') - LEN(REPLACE('blablabla text', ' ', '')) + 1.

Now the datatype is nvarchar(max) since the text is very long. I have already tried to convert the column into text or ntext and to apply datalength() instead of len(). Nevertheless I obtain the same result that it does work as a string but does not work from a table.

Do a `SELECT ExtractedText FROM EDDSDBO.Document WHERE ID='100'` make sure the string is what you expect it to be. — nobody, Aug 11 '11 at 07:55

score 2 · Answer 1 · answered Aug 11 '11 at 07:52

You're counting spaces not words. That will typically yield an approximate answer.

e.g.

'  this     string    will    give an    incorrect      result   '

Try this approach: http://www.sql-server-helper.com/functions/count-words.aspx

CREATE FUNCTION [dbo].[WordCount] ( @InputString VARCHAR(4000) ) 
RETURNS INT
AS
BEGIN

DECLARE @Index          INT
DECLARE @Char           CHAR(1)
DECLARE @PrevChar       CHAR(1)
DECLARE @WordCount      INT

SET @Index = 1
SET @WordCount = 0

WHILE @Index <= LEN(@InputString)
BEGIN
    SET @Char     = SUBSTRING(@InputString, @Index, 1)
    SET @PrevChar = CASE WHEN @Index = 1 THEN ' '
                         ELSE SUBSTRING(@InputString, @Index - 1, 1)
                    END

    IF @PrevChar = ' ' AND @Char != ' '
        SET @WordCount = @WordCount + 1

    SET @Index = @Index + 1
END

RETURN @WordCount

END
GO

usage

DECLARE @String VARCHAR(4000)
SET @String = 'Health Insurance is an insurance against expenses incurred through illness of the insured.'

SELECT [dbo].[WordCount] ( @String )

score 1 · Answer 2 · edited May 23 '17 at 12:09

Leading spaces, trailing spaces, two or more spaces between the neighbouring words – these are the likely causes of the wrong results you are getting.

The functions LTRIM() and RTRIM() can help you eliminate the first two issues. As for the third one, you can use REPLACE(ExtractedText, ' ', ' ') to replace double spaces with single ones, but I'm not sure if you do not have triple ones (in which case you'd need to repeat the replacing).

UPDATE

Here's a UDF that uses CTEs and ranking to eliminate extra spaces and then counts the remaining ones to return the quantity as the number of words:

CREATE FUNCTION fnCountWords (@Str varchar(max))
RETURNS int
AS BEGIN
  DECLARE @xml xml, @res int;
  SET @Str = RTRIM(LTRIM(@Str));
  WITH split AS (
    SELECT
      idx = number,
      chr = SUBSTRING(@Str, number, 1)
    FROM master..spt_values
    WHERE type = 'P'
      AND number BETWEEN 1 AND LEN(@Str)
  ),
  ranked AS (
    SELECT
      idx,
      chr,
        rnk = idx - ROW_NUMBER() OVER (PARTITION BY chr ORDER BY idx)
      FROM split
  )
  SELECT @res = COUNT(DISTINCT rnk) + 1
  FROM ranked
  WHERE chr = ' ';
  RETURN @res;
END

With this function your query will be simply like this:

SELECT fnCountWords(ExtractedText)
FROM EDDSDBO.Document
WHERE ID='100'

UPDATE 2

The function uses one of the system tables, master..spt_values, as a tally table. The particular subset used contains only values from 0 to 2047. This means the function will not work correctly for inputs longer than 2047 characters (after trimming both leading and trailing spaces), as @t-clausen.dk has correctly noted in his comment. Therefore, a custom tally table should be used if longer input strings are possible.

nvarchar(max) is very long, you could run into problems when there are more than 2047 words — t-clausen.dk, Aug 11 '11 at 09:20

t-clausen.dk · Answer 3 · 2011-08-11T12:08:50.510

Replace the spaces with something that never occur in your text like ' $!' or pick another value. then replace all '$! ' and '$!' with nothing this way you never have more than 1 space after a word. Then use your current script. I have defined a word as a space followed by a non-space.

This is an example

DECLARE @T TABLE(COL1 NVARCHAR(2000), ID INT)

INSERT @T VALUES('A B  C   D', 100)

SELECT LEN(C) - LEN(REPLACE(C,' ', '')) COUNT FROM (
SELECT REPLACE(REPLACE(REPLACE(' ' + COL1, ' ', ' $!'), '$! ',''), '$!', '') C
FROM @T ) A

Here is a recursive solution

DECLARE @T TABLE(COL1 NVARCHAR(2000), ID INT)

INSERT @T VALUES('A B  C   D', 100)
INSERT @T VALUES('have a nice day with 7 words', 100)

;WITH CTE AS
(
SELECT 1 words, col1 c, col1 FROM @t WHERE id = 100
UNION ALL
SELECT words +1, right(c, len(c) - patindex('% [^ ]%', c)), col1 FROM cte
WHERE patindex('% [^ ]%', c) > 0
)
SELECT words, col1 FROM cte WHERE patindex('% [^ ]%', c) = 0

score 0 · Answer 4 · edited Apr 20 '17 at 11:29

0

You should declare the column using the varchar data type, like:

create table emp(ename varchar(22));

insert into emp values('amit');

select ename,len(ename) from emp;

output : 4

edited Apr 20 '17 at 11:29

Athafoud

2,898
3
40
58

answered Aug 11 '11 at 08:49

Mohammad Jahangeer Ansari

949
2
7
10

I think you misunderstand the question, as I understand it, the OP is interested in counting words – t-clausen.dk Aug 11 '11 at 09:32

SQL: problem word count with len()

4 Answers4