12

I'd like to make an SQL query where the condition is that column1 contains three or more words. Is there something to do that?

Bob
  • 10,741
  • 27
  • 89
  • 143
  • 4
    Yes, you should first decide on the database you are using and tag your question appropriately. MySQL? SQL Server? Postgres? Something else. – Gordon Linoff Jan 07 '15 at 19:44
  • http://dev.mysql.com/doc/refman/5.0/en/string-functions.html – Marc B Jan 07 '15 at 19:44
  • 1
    maybe use a regular expression - then count the matches – Randy Jan 07 '15 at 19:46
  • 1
    possible duplicate of [Using SQL to determine word count stats of a text field](http://stackoverflow.com/questions/748276/using-sql-to-determine-word-count-stats-of-a-text-field) – Mike G Jan 07 '15 at 19:50
  • You will not find a single SQL statement which does this reliably on **all** mentioned DBMS –  Jan 08 '15 at 07:35

10 Answers10

32

maybe try counting spaces ?

SELECT * 
FROM table
WHERE (LENGTH(column1) - LENGTH(replace(column1, ' ', ''))) > 1

and assume words is number of spaces + 1

Steve
  • 8,153
  • 9
  • 44
  • 91
David Chan
  • 7,347
  • 1
  • 28
  • 49
5

If you want a condition that a column contains three or more words and you want it to work in a bunch of databases and we assume that words are separated by single spaces, then you can use like:

where column1 like '% % %'
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • that would match a column containing only two spaces `where column1 like '%_ %_ %_'` is only slightly better. – Jasen Jan 07 '15 at 20:04
  • @Jasen . . . Read the answer. The assumption is clearly stated that words are separated by a single space. A string consisting only of spaces would not meet that condition. – Gordon Linoff Jan 07 '15 at 20:06
  • I must admit that your answer is the only one that will execute in all three databases. – Jasen Jan 07 '15 at 20:25
5

I think David nailed it above. However, as a more complete answer:

LENGTH(RTRIM(LTRIM(REPLACE(column1,'  ', ' ')))) - LENGTH(REPLACE(RTRIM(LTRIM(REPLACE(column1, '  ', ' '))), ' ', '')) + 1 AS number_of_words

This will remove double spaces, as well as leading and trailing spaces in your string.

Of course, you may go further by adding replacements for more than 2 spaces in a row...

3

In Postgres you can use regexp_split_to_array() for this:

select *
from the_table
where array_length(regexp_split_to_array(the_column, '\s+'), 1) >= 3;

This will split the contents of the column the_column into array elements. One ore more whitespace are used as the delimiter. It won't respect "quoted" spaces though. The value 'one "two three" four' will be counted as four words.

  • Can you explain to me in detail what that "1" flag at the end of regexp_split_to_array does? I can't find the documentation on it – Tom Dec 14 '16 at 20:40
  • @Tom: that's a parameter for `array_length()` not for `regexp_split_to_array()` –  Dec 14 '16 at 20:43
3

The best way to do this, is to NOT do this.

Instead, you should use the application layer to count the words during INSERT and save the word count into its own column.

While I like, and upvoted, some of the answers here, all of them will be very slow and not 100% accurate.

I know people want a simple answer to SELECT the word count, but it just is NOT POSSIBLE with accuracy and speed.

If you want it to be 100% accurate, and very fast, then use this solution.

Steps to solve:

  1. Add a column to your table and index it: ALTER TABLE tablename ADD COLUMN wordcount INT UNSIGNED NULL, ADD INDEX idxtablename_count (wordcount ASC);.
  2. Before doing your INSERT, count the number of words using your application. For example in PHP: $count = str_word_count($somevalue);
  3. During the INSERT, include the value of $count for the column wordcount like insert into tablename (col1, col2, col3, wordcount) values (val1, val2, val3, $count);

Then your select statement becomes super easy, clean, uber-fast, and 100% accurate.

select * from tablename where wordcount >= 3;

Also remember when you are updating any rows that you will need to recount the words for that column.

Evan de la Cruz
  • 1,966
  • 1
  • 13
  • 17
  • This is the best answer because it is the only one that is 100% accurate and does not do a full table scan (and so it is fast, unlike the other solutions) – Evan de la Cruz Mar 07 '16 at 23:20
  • 8
    This is not the best answer because it does not answer the question. – Ken Johnson Oct 28 '16 at 20:03
  • @KenJohnson "I'd like to make an SQL query where the condition is that column1 contains three or more words. Is there something to do that?" was the question and `select * from tablename where wordcount >= 3;` is my answer, if you want to be semantically accurate. OP asked for a query. OP got a query. – Evan de la Cruz Oct 28 '16 at 23:59
  • 1
    @KenJohnson, I'm a little late here, but sometimes a question can be answered by asking a different question and answering that one. In my experience, some hard problems become easy when adding a layer of abstraction or solving the problem at a different stage in the code's lifecycle. – Jared Menard May 15 '19 at 19:35
  • 1
    @JaredMenard I totally agree with that sentiment, and i believe that this answer has a place in this thread. I merely contest the statement that this is the "best answer" since it answers it in a roundabout way, with contrived constraints (albeit constraints that others may identify with). – Ken Johnson May 16 '19 at 21:23
  • @KenJohnson Then "best" without further specification is a subjective term. – sigma Oct 31 '22 at 13:04
1

For "n" or more words

select *
from table
where (length(column)- length(replace(column, " ", "")) + 1) >= n

PS: This would not work if words have multiple spaces between them.

Trideep Rath
  • 3,623
  • 1
  • 25
  • 14
1

With ClickHouse DB You can use splitByWhitespace() function.

Refer : https://clickhouse.com/docs/en/sql-reference/functions/splitting-merging-functions#splitbywhitespaces

  • Your answer could be improved by adding an example of query (relevant to the question) using this function in the answer itself. – YurkoFlisk Aug 08 '22 at 02:28
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/32405542) – Webdeveloper_Jelle Aug 09 '22 at 11:07
1

None of the other answers seem to take multiple spaces into account. For example, a lot of people use two spaces between sentences; these space-counters would count an extra word per sentence. "Also, scenarios such as spaces around a hyphen - like that. "

For my purposes, this was far more accurate:

SELECT 
  LENGTH(REGEXP_REPLACE(myText, '[ \n\t\|\-]{1,}',' ')) - 
  LENGTH(REGEXP_REPLACE(myText, '[ \n\t\|\-]{1,}', '')) wordCount FROM myTable;

It counts any sets of 1 or more consecutive characters from any of: [space, linefeed, tab, pipe, or hyphen] and counts it as one word.

ashleedawg
  • 20,365
  • 9
  • 72
  • 105
0

This can work:

SUM(LENGTH(a) - LENGTH(REPLACE(a, ' ', '')) + 1)

Where a is the string column. It will count the number of spaces, which is 1 less than the number of words.

Shahar
  • 1,687
  • 2
  • 12
  • 18
-2

To handle multiple spaces too, use the method shown here

Declare @s varchar(100)
set @s='  See      how many                        words this      has  '
set @s=ltrim(rtrim(@s))

while charindex('  ',@s)>0
Begin
    set @s=replace(@s,'  ',' ')
end

select len(@s)-len(replace(@s,' ',''))+1 as word_count

https://exploresql.com/2018/07/31/how-to-count-number-of-words-in-a-sentence/

Madhivanan
  • 13,470
  • 1
  • 24
  • 29