1

Using MySQL, I'm trying to count the number of occurrences of a substring in a column, for each row.

In the example table below, the string "art" appears three times in the Terms column, so it would have a count of 3.

Example Table:

TERMS
art
artistic
painting
elephant
art deco
paint
paintings

Desired Output:

TERMS      COUNT
art        3
artistic   1
painting   2
elephant   1
art deco   1
paint      2
paintings  1

Edits:

As a starting point, I know that:

SELECT terms, COUNT(*)
FROM table
GROUP BY terms

will output number of occurrences for each full term string. For substring matching, I think this may involve a subquery.

Tried the following, but every count was 1.

SELECT terms, ROUND((CHAR_LENGTH(terms) - CHAR_LENGTH(REPLACE(terms, terms, ""))) / CHAR_LENGTH(terms)) AS count
FROM table
GROUP BY terms
OBreaux1
  • 13
  • 1
  • 4

1 Answers1

3

I would write this by first writing a query that just returns a list of terms we want to return. For example:

 SELECT t.terms
   FROM `table` t
  GROUP BY t.terms

Then wrap that in parens and use it as an inline view...

SELECT w.terms
  FROM ( SELECT t.terms
           FROM `table` t
          GROUP BY t.terms
       ) w
 ORDER BY w.terms

With that, we can do a join operation to look for matching rows, and get a count. Assuming a guarantee that terms doesn't contain underscore (_) or percent (%) characters, we can use a LIKE comparison.

Given that every term in our list is going to appear at least one time, we can use an inner join. In the more general case, where we might expect to return a zero count, we would use an outer join.

SELECT w.terms
     , COUNT(1) AS `COUNT`
  FROM ( SELECT t.terms
           FROM `table` t
          GROUP BY t.terms
       ) w
  JOIN `table` c
    ON c.terms LIKE CONCAT('%', w.terms ,'%')
 GROUP BY w.terms
 ORDER BY w.terms

In the LIKE comparison, the percent signs are wildcards that match any characters (zero, one or more).

If there's a possibility that terms does contain underscore or percent characters, we can escape those so they aren't considered wildcards by the LIKE comparison. An expression like this should do the trick:

 REPLACE(REPLACE( w.terms ,'_','\_'),'%','\%')

So we'd have a query like this:

SELECT w.terms
     , COUNT(1) AS `COUNT`
  FROM ( SELECT t.terms
           FROM `table` t
          GROUP BY t.terms
       ) w
  JOIN `table` c
    ON c.terms LIKE CONCAT('%',REPLACE(REPLACE( w.terms ,'_','\_'),'%','\%'),'%')
 GROUP BY w.terms
 ORDER BY w.terms

There are other query patterns that will return the specified result. This is just a demonstration of one approach.


NOTE: In the example in the question, every terms that is a substring of another terms, the substring match appears at the beginning of the terms. This query will also find matches where the term isn't at the beginning.

e.g. dartboard would be considered a match to art

The query could be modified to match terms that appear only at the beginning of other terms.

FOLLOWUP

With the example data, returns:

terms      COUNT     matched_terms
---------  --------  -------------------------
art               3  art,art deco,artistic
art deco          1  art deco
artistic          1  artistic
elephant          1  elephant
paint             3  paint,painting,paintings
painting          2  painting,paintings
paintings         1  paintings

In addition to the COUNT(1) aggregate, I also included another expression in the select list. This isn't required, but it does give some additional information about which terms were considered to be matches.

 GROUP_CONCAT(DISTINCT c.terms ORDER BY c.terms) AS `matched_terms`

NOTE: If there's a possibility that terms contains backslash characters, we can escape those characters as well, using another REPLACE

 REPLACE(REPLACE(REPLACE( w.terms ,'\\','\\\\'),'_','\_'),'%','\%')
                 ^^^^^^^^         ^^^^^^^^^^^^^
spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • 1
    Very nice answer! but dont you need a `GROUP BY w.term` also before `ORDER BY`? – Michael Krikorev Mar 08 '17 at 23:35
  • 1
    @MichaelK: yes. we do need a GROUP BY. *DOH!* (answer edited to incorporate the correction.) – spencer7593 Mar 08 '17 at 23:37
  • Incredible answer! Thank you. Not only did this accomplish my goal, but I learned a lot while studying your answer. I ran it on a table with 26k rows, and the query took 175s to complete. – OBreaux1 Mar 09 '17 at 00:01
  • @OBreaux1: The LIKE comparison with the percent wildcards is going to require that every row in `table` be examined, for every `terms` entry. If those are longish rows in the table (lots more columns in the table) an index ... ON `table (terms)` may benefit performance, if the query is satisfied entirely from the index. – spencer7593 Mar 09 '17 at 01:06
  • 1
    @OBreaux1: This query returns the number of *rows* that include a `terms`; the count is increment by 1 if there is a match. For example, "dartmart" would match "art", but the count would be incremented by one. Not two, even though "art" appears two times. – spencer7593 Mar 09 '17 at 01:11