-1

I was wondering, how can I write a query that counts the number of words in a sentence, not including one-word special characters / punctuations?

for example:

sentence numofwords
'Hello world, how are you?' 5
'Me & my friend are here' 5
'I think it's great' 4
'Who # is # that?' 3
'Oh , didn't see you there!' 5

I got to this point:

with cte as(
    select sentence,
           replace(sentence, '&', '') as nosymbolname
    from database
)
select sentence,
       (length(nosymbolname) - length(replace(nosymbolname, ' ', ''))) + 1 wordsinname
from cte

but it does not apply for cases that have one-word special characters/ punctuations, and I can't really think of a way to address those scenarios.

Thanks!

lemon
  • 14,875
  • 6
  • 18
  • 38
Ben Kitai
  • 9
  • 2
  • 1
    You need to define the characters that can compose a word, then stick to your definition while you attempt to solve this problem (and accept consequent tradeoffs). If there's no such definition, there may not be a "fully-valid" solution to this problem, due to the complexity of natural language. – lemon Jan 28 '23 at 18:29
  • What is precise MySQL version? – Akina Jan 28 '23 at 19:47
  • Mysql is not really suitable for this task as its text manipulation and regexp functions have limited capabilities. I suggest using python or r that have existing libraries for this purpose instead of trying to reinvent the wheel. Not to mention the fact that didn't stands for did not, so should count as two words, not just one. – Shadow Jan 28 '23 at 19:53

1 Answers1

0
SELECT sentence,
       SUM(one_word REGEXP '\\w') cnt
FROM test
CROSS JOIN JSON_TABLE(CONCAT('["', REPLACE(sentence, ' ', '","'), '"]'),
                      '$[*]' COLUMNS (one_word TEXT PATH '$')) jsontable
GROUP BY 1

fiddle

Adjust REGEXP pattern if needed.

Akina
  • 39,301
  • 5
  • 14
  • 25