1

I am attempting to use a dialect of SQL to parse a set of records. Specifically I want to count the number records that contain various ids (words) occur in my logs. I also want to count the total number of times these ids are present.

The data is present in two different tables:

  • id_status - A table contain id,status.
  • request_records - A table which contains request records. An id might be present more than once in a given record.

See SQL Fiddle!

id_status

This table contains the ids and their current status.

id, status
sessions, ACTIVE
visits, DEPRECATED
duration, ACTIVE
...

request_records

Sentence record may contain over 300k records:

request
example.com/api?foo=sessions
example.com/api?bar=session%2Cvisits,foo=sessions
example.com/api?bar=duration,visits
example.com/api?foo=sessions
example.com/api?foo=visits,bar=visits
...

Goal

I want to create a table containing four columns id,status,occurence_count,record_count.

  • occurence_count: should be the total number of times an id occurs in all the records.
  • recourd_count: should be the the total number of records a particular id occurs in.

As a result from the above example I would generate the following table:

id, status, occurence_count, recourd_count
sessions,ACTIVE,3,2
visits,DEPRECATED,4,3
duration,ACTIVE,1,1

SQL

Is there a way to construct an SQL query which will return the desired table? I would know how to do it fairly readily in Python or any other language but if possible I'd like to write a SELECT statement to generate this table.

Update: As suggested below, I have attempted to execute the following:

SELECT ids.id, ids.status,
     SUM(length(request) - LENGTH(replace(request, ids.id, ''))) / LENGTH(ids.id) as occurence_count,
     COUNT(reqs.request) AS recourd_count
FROM id_status ids LEFT JOIN
     request_records reqs
     ON find_in_set(ids.id, reqs.requests) > 0
GROUP BY ids.id, ids.status;

However my dialect of SQL does not permit this type of join and returns the following error:

ON clause must be AND of = comparisons of one field name from each table, with all field names prefixed with table name.

Added SQL Fiddle with Sample Schema:

CREATE TABLE id_status
    (`id` varchar(32), `status` varchar(32))
;

INSERT INTO id_status
    (`id`, `status`)
VALUES
    ('sessions', 'ACTIVE'),
    ('visits', 'DEPRECATED'),
    ('duration', 'ACTIVE')
;

CREATE TABLE request_records 
    (`request` varchar(500))
;
Matt
  • 5,028
  • 2
  • 28
  • 55
  • 1
    See http://meta.stackoverflow.com/questions/333952/why-should-i-provide-an-mcve-for-what-seems-to-me-to-be-a-very-simple-sql-query – Strawberry Jan 09 '17 at 17:48
  • Do you want the exact word match? For example in the first record of sentencerecord table, should it count session once or twice? – Karan Shah Jan 09 '17 at 17:55
  • @karan Shah: So, I am looking for the total number of occurrences and the number of records that the word occurs in. so the first record would increment the total count by 2, and the number of record occurrences by 1. – Matt Jan 09 '17 at 18:00
  • I think you got your answer, let me know if you need another look – Karan Shah Jan 09 '17 at 18:08
  • @Strawberry, I have updated the question to include [SQL Fiddle sample Schema](http://sqlfiddle.com/#!9/4349b/2) with the suggested solution failing as well as rephrased the question to be clearer. Please re-open or provide further details on how I could improve the question to be answerable? – Matt Jan 10 '17 at 20:47

2 Answers2

3

You can generate what you want with a query. It will not be efficient:

select w.word, w.status,
       sum(length(sentence) - length(replace(sentence, w.word, ''))) / length(w.word) as cnt,
       count(s.sentence) as num_occurrences
from words w left join
     sentences s
     on find_in_set(w.word, s.sentence) > 0
group by w.word, w.status;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • 1
    I've upvoted this but just fyi, this solution will not work in the case that a "word" contains another "word". For example, if you had a sentence "duration,visits,sessions,new-sessions", the count listed for "sessions" would be greater than it should actually be. If this is a concern consider concatenating a comma to the end of each sentence and then replacing the word+',' with an empty string. You'd have to divide by length(w.word) + 1 in that case. But really if this is a type of query that needs to be repeatedly performed it might be worth restructuring your DB. – VKK Jan 09 '17 at 18:18
  • @VKK . . . Your comment is very true. There is a solution to it for the comma-delimited lists that the OP appears to be using, but it does make the expression much messier. – Gordon Linoff Jan 10 '17 at 01:57
  • So I've added your solution to the question and you can see from the sql fiddle that it does not quite give the answer I was looking for, but thank you for helping. – Matt Jan 10 '17 at 22:27
1

Simple version:

SELECT id,
       status,
       (SELECT (SUM(CHAR_LENGTH(request)) -
                SUM(CHAR_LENGTH(REPLACE(request, id, ''))))
               / CHAR_LENGTH(id) FROM request_records) AS occurrence_count,
       (SELECT COUNT(*)
        FROM request_records
        WHERE INSTR(request, id) > 0) AS record_count
FROM id_status

See updated SQL fiddle.

This just looks for the word being present anywhere so for example would blindly count abcsessions or sessionsxyz when looking for sessions. If something more complex is needed (e.g. considering word boundaries), you might need to use a regular expression replacer rather than the REPLACE function - but unfortunately this is one thing MySQL lacks. I once had a go at writing one but it's not trivial - see here.

Community
  • 1
  • 1
Steve Chambers
  • 37,270
  • 24
  • 156
  • 208
  • 1
    Thanks Steve, this looks promising, I will test it out once I get to work. If I run into issues I might ping you later. I started to get a regular expression working just before leaving work yesterday, but your solution looks much more elegant. – Matt Jan 12 '17 at 12:24
  • No probs. Am guessing you'll already know that MySQL provides the `REGEXP` function that can be easily used for calculating the `record_count` - just do something like `WHERE request REGEXP CONCAT('[[:<:]]', id, '[[:>:]]')` (see [here](http://dev.mysql.com/doc/refman/5.7/en/regexp.html) for more info). It's only the `occurrence_count` where a replace is needed to count the number of occurrences in a single field value. – Steve Chambers Jan 12 '17 at 14:04