How to extract the nth word and count word occurrences in a MySQL string?

Question

I would like to have a mysql query like this:

select <second word in text> word, count(*) from table group by word;

All the regex examples in mysql are used to query if the text matches the expression, but not to extract text out of an expression. Is there such a syntax?

score 45 · Answer 1 · edited Aug 03 '17 at 13:33

45

The following is a proposed solution for the OP's specific problem (extracting the 2nd word of a string), but it should be noted that, as mc0e's answer states, actually extracting regex matches is not supported out-of-the-box in MySQL. If you really need this, then your choices are basically to 1) do it in post-processing on the client, or 2) install a MySQL extension to support it.

BenWells has it very almost correct. Working from his code, here's a slightly adjusted version:

SUBSTRING(
  sentence,
  LOCATE(' ', sentence) + CHAR_LENGTH(' '),
  LOCATE(' ', sentence,
  ( LOCATE(' ', sentence) + 1 ) - ( LOCATE(' ', sentence) + CHAR_LENGTH(' ') )
)

As a working example, I used:

SELECT SUBSTRING(
  sentence,
  LOCATE(' ', sentence) + CHAR_LENGTH(' '),
  LOCATE(' ', sentence,
  ( LOCATE(' ', sentence) + 1 ) - ( LOCATE(' ', sentence) + CHAR_LENGTH(' ') )
) as string
FROM (SELECT 'THIS IS A TEST' AS sentence) temp

This successfully extracts the word IS

edited Aug 03 '17 at 13:33

LinusR

1,149
10
17

answered Oct 26 '10 at 08:27

Brendan Bullen

11,607
1
31
40

12

Yes I missed the sentence reference and the +1 to Locate. I'm not bothered about the 'accepted answer' I just want to be helpful. – BenWells Oct 13 '11 at 11:53
Can make it a bit more generic not just for space, and also add the LENGTH to match the value after the separator, e.g. `LOCATE(' ', sentence) + STRLEN('')` – Noam Jul 10 '16 at 09:17
1

It's almost perfect except it actually returns " IS", which is incorrect. Please see my answer below. – Hypolite Petovan Aug 18 '16 at 19:51
2

I agree with @HypolitePetovan, this answer is slightly incorrect in that it returns 3 chars instead of 2. It is also positioned incorrectly. I have suggested an edit that includes adding in CHAR_LENGTH to correctly position and determine the correct length. Its hard when working with whitespace but running a CHAR_LENGTH on the whole select shows it is returning 3 chars – doz87 Jan 12 '17 at 08:51

score 28 · Answer 2 · edited Mar 23 '16 at 18:44

28

Shorter option to extract the second word in a sentence:

SELECT SUBSTRING_INDEX(SUBSTRING_INDEX('THIS IS A TEST', ' ',  2), ' ', -1) as FoundText

MySQL docs for SUBSTRING_INDEX

edited Mar 23 '16 at 18:44

Umbrella

4,733
2
22
31

answered Sep 07 '12 at 11:51

Damien Goor

289
3
2

score 14 · Answer 3 · edited Aug 11 '16 at 06:26

14

According to http://dev.mysql.com/ the SUBSTRING function uses start position then the length so surely the function for the second word would be:

SUBSTRING(sentence,LOCATE(' ',sentence),(LOCATE(' ',LOCATE(' ',sentence))-LOCATE(' ',sentence)))

edited Aug 11 '16 at 06:26

George G

7,443
12
45
59

answered Oct 26 '10 at 08:02

BenWells

317
1
10

score 7 · Answer 4 · answered Oct 26 '10 at 07:30

No, there isn't a syntax for extracting text using regular expressions. You have to use the ordinary string manipulation functions.

Alternatively select the entire value from the database (or the first n characters if you are worried about too much data transfer) and then use a regular expression on the client.

score 5 · Answer 5 · answered Aug 19 '13 at 06:00

As others have said, mysql does not provide regex tools for extracting sub-strings. That's not to say you can't have them though if you're prepared to extend mysql with user-defined functions:

https://github.com/mysqludf/lib_mysqludf_preg

That may not be much help if you want to distribute your software, being an impediment to installing your software, but for an in-house solution it may be appropriate.

Hypolite Petovan · Answer 6 · 2016-08-26T20:05:57.160

I used Brendan Bullen's answer as a starting point for a similar issue I had which was to retrive the value of a specific field in a JSON string. However, like I commented on his answer, it is not entirely accurate. If your left boundary isn't just a space like in the original question, then the discrepancy increases.

Corrected solution:

SUBSTRING(
    sentence,
    LOCATE(' ', sentence) + 1,
    LOCATE(' ', sentence, (LOCATE(' ', sentence) + 1)) - LOCATE(' ', sentence) - 1
)

The two differences are the +1 in the SUBSTRING index parameter and the -1 in the length parameter.

For a more general solution to "find the first occurence of a string between two provided boundaries":

SUBSTRING(
    haystack,
    LOCATE('<leftBoundary>', haystack) + CHAR_LENGTH('<leftBoundary>'),
    LOCATE(
        '<rightBoundary>',
        haystack,
        LOCATE('<leftBoundary>', haystack) + CHAR_LENGTH('<leftBoundary>')
    )
    - (LOCATE('<leftBoundary>', haystack) + CHAR_LENGTH('<leftBoundary>'))
)

score 2 · Answer 7 · edited Jun 19 '19 at 18:17

2

I don't think such a thing is possible. You can use SUBSTRING function to extract the part you want.

edited Jun 19 '19 at 18:17

shaik moeed

5,300
1
18
54

answered Oct 26 '10 at 07:30

user483085

71
3

Steve Chambers · Answer 8 · 2018-05-21T10:08:23.107

My home-grown regular expression replace function can be used for this.

Demo

See this DB-Fiddle demo, which returns the second word ("I") from a famous sonnet and the number of occurrences of it (1).

SQL

Assuming MySQL 8 or later is being used (to allow use of a Common Table Expression), the following will return the second word and the number of occurrences of it:

WITH cte AS (
     SELECT digits.idx,
            SUBSTRING_INDEX(SUBSTRING_INDEX(words, '~', digits.idx + 1), '~', -1) word
     FROM
     (SELECT reg_replace(UPPER(txt),
                         '[^''’a-zA-Z-]+',
                         '~',
                         TRUE,
                         1,
                         0) AS words
      FROM tbl) delimited
     INNER JOIN
     (SELECT @row := @row + 1 as idx FROM 
      (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) t1,
      (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) t2, 
      (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) t3, 
      (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) t4, 
      (SELECT @row := -1) t5) digits
     ON LENGTH(REPLACE(words, '~' , '')) <= LENGTH(words) - digits.idx)
SELECT c.word,
       subq.occurrences
FROM cte c
LEFT JOIN (
  SELECT word,
         COUNT(*) AS occurrences
  FROM cte
  GROUP BY word
) subq
ON c.word = subq.word
WHERE idx = 1; /* idx is zero-based so 1 here gets the second word */

Explanation

A few tricks are used in the SQL above and some accreditation is needed. Firstly the regular expression replacer is used to replace all continuous blocks of non-word characters - each being replaced by a single tilda (~) character. Note: A different character could be chosen instead if there is any possibility of a tilda appearing in the text.

The technique from this answer is then used for transforming a string with delimited values into separate row values. It's combined with the clever technique from this answer for generating a table consisting of a sequence of incrementing numbers: 0 - 10,000 in this case.

score -2 · Answer 9 · edited Nov 01 '12 at 00:07

-2

The field's value is:

 "- DE-HEB 20% - DTopTen 1.2%"
SELECT ....
SUBSTRING_INDEX(SUBSTRING_INDEX(DesctosAplicados, 'DE-HEB ',  -1), '-', 1) DE-HEB ,
SUBSTRING_INDEX(SUBSTRING_INDEX(DesctosAplicados, 'DTopTen ',  -1), '-', 1) DTopTen ,

FROM TABLA

Result is:

  DE-HEB       DTopTEn
    20%          1.2%

edited Nov 01 '12 at 00:07

jonsca

10,218
26
54
62

answered Oct 31 '12 at 23:44

Antonio Rivera

19

How to extract the nth word and count word occurrences in a MySQL string?

9 Answers9

Linked

Related