Split string on token and aggregate on split words

Question

I have a field that includes files that have 'words' separated by an underscore, _, such as this:

`file_name`
MY_NEW_MOVIE.mov
HD_VIDEO_720p.mov
720p_DISNEY_MOVIE.mov
LG_TYLERPERRY_FEATURE_HD_8CH_EN_L9714343_16X9_235_2398_FINAL_FRSUB.srt

And I want to split on _ and get the count of each word after the split, meaining:

`word`    `count`
MY        1
NEW       1
MOVIE     2
HD        1
VIDEO     1
720p      2
DISNEY    1

Would it be possible/feasible to do this in SQL? So far I have just gotten the perfunctory "remove the file extension", but not sure how I could split on the token and then count that:

select left(file_name, length(file_name) - length(substring_index(file_name, '.', -1))-1) from asset

Additionally,

@Nick -- unfortunately 5.6 -- no json/array stuff I don't believe. — David542, May 14 '20 at 04:43
You'll probably want to take a look at [this question](https://stackoverflow.com/questions/17942508/sql-split-values-to-multiple-rows) which you can adapt to an `_` separated list. You can then use that as a derived table which you can COUNT over. — Nick, May 14 '20 at 04:52
Are there _always_ three components in each filename? Or, could there be more (or less) than three? — Tim Biegeleisen, May 14 '20 at 04:52
@TimBiegeleisen actually good point -- I've provided poor sample data, let me revise that. — David542, May 14 '20 at 05:31

Tim Biegeleisen · Answer 1 · 2020-05-14T05:42:46.970

1

Assuming the filenames always have exactly three components, SUBSTRING_INDEX can get the job done here:

SELECT word, COUNT(*) AS count
FROM
(
    SELECT SUBSTRING_INDEX(file_name, '_', 1) AS word FROM asset
    UNION ALL
    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(file_name, '_', 2), '_', -1) FROM asset
    UNION ALL
    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(file_name, '_', -1), '.', 1) FROM asset
) t
GROUP BY word;

Demo

Note: This answer was given based on the OP's original sample data, where all filenames had exactly three underscore-separate components. This answer will not work for the updated question.

edited May 14 '20 at 05:42

answered May 14 '20 at 04:57

Tim Biegeleisen

502,043
27
286
360

great answer, I've updated the question though, unfortunately there can be any number -- actually I think I've seen up to about 18 in a filename. – David542 May 14 '20 at 05:32
Then you are probably going to need to write a stored procedure to solve this. This is not a problem well suited to MySQL. – Tim Biegeleisen May 14 '20 at 05:34
1

yea currently I'm using `collections.Counter()` in python, but it would make things a lot quicker if I could push it all into the DB, but it seems like it's a task not too well suited to mysql. – David542 May 14 '20 at 05:35
See, the problem you have here is that logically a single point of data is one word. Best practice and normalization dictate that therefore each word should be in a separate record. But, you have 18+ words/points in a single record. This is the problem. – Tim Biegeleisen May 14 '20 at 05:36
yea, these are filenames/filepaths though so I can't do much about it -- for example, something that is stored on S3. Actually the point of this whole task is exactly what you mention -- to take the filepaths and to extract the data within it so I can make sense of it all. – David542 May 14 '20 at 05:40
1

I vote for scrubbing this data in your Python script _before_ bringing it into MySQL. – Tim Biegeleisen May 14 '20 at 05:41

score 1 · Accepted Answer · answered May 14 '20 at 06:05

The result you want can be achieved with a query derived from this answer, which uses a generated numbers table along with SUBSTRING_INDEX to split out all the words in each file_name. This is then used as a derived table to count the occurrence of each word. Note the numbers table must have sufficient values to cover the maximum number of words in a filename (12 for this sample data).

SELECT word, COUNT(*) 
FROM (
  SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(LEFT(file_name, LENGTH(file_name)-4), '_', numbers.n), '_', -1) AS word
  FROM (
    select 1 n union all
    select 2 union all select 3 union all select 4 union all
    select 5 union all select 6 union all select 7 union all
    select 8 union all select 9 union all select 10 union all
    select 11 union all select 12
  ) numbers
  JOIN asset ON LENGTH(file_name)
              - LENGTH(REPLACE(file_name, '_', '')) >= numbers.n - 1
) w
GROUP BY word

Output (for your sample data):

word        COUNT(*)
16X9        1
235         1
2398        1
720p        2
8CH         1
DISNEY      1
EN          1
FEATURE     1
FINAL       1
FRSUB       1
HD          2
L9714343    1
LG          1
MOVIE       2
MY          1
NEW         1
TYLERPERRY  1
VIDEO       1

Demo on dbfiddle

oh that's awesome, thank you very much for that! Though after running it, seems it's a bit simpler (and faster -- the non-indexed, on-the-fly join is pretty slow) doing it in python after retrieving the list of files, and also in that case I can add some more complex logic to it, etc. — David542, May 14 '20 at 06:25
@David542 yeah - this is not the sort of thing SQL is good at. I'm not surprised it's faster in python (easier to write too!). But I figured it's good to have something to compare against. — Nick, May 14 '20 at 06:44

Split string on token and aggregate on split words

2 Answers2

Demo