1

I have a field that includes files that have 'words' separated by an underscore, _, such as this:

`file_name`
MY_NEW_MOVIE.mov
HD_VIDEO_720p.mov
720p_DISNEY_MOVIE.mov
LG_TYLERPERRY_FEATURE_HD_8CH_EN_L9714343_16X9_235_2398_FINAL_FRSUB.srt

And I want to split on _ and get the count of each word after the split, meaining:

`word`    `count`
MY        1
NEW       1
MOVIE     2
HD        1
VIDEO     1
720p      2
DISNEY    1

Would it be possible/feasible to do this in SQL? So far I have just gotten the perfunctory "remove the file extension", but not sure how I could split on the token and then count that:

select left(file_name, length(file_name) - length(substring_index(file_name, '.', -1))-1) from asset

Additionally,

David542
  • 104,438
  • 178
  • 489
  • 842
  • Which version of MySQL? – Nick May 14 '20 at 04:41
  • @Nick -- unfortunately 5.6 -- no json/array stuff I don't believe. – David542 May 14 '20 at 04:43
  • You'll probably want to take a look at [this question](https://stackoverflow.com/questions/17942508/sql-split-values-to-multiple-rows) which you can adapt to an `_` separated list. You can then use that as a derived table which you can COUNT over. – Nick May 14 '20 at 04:52
  • Are there _always_ three components in each filename? Or, could there be more (or less) than three? – Tim Biegeleisen May 14 '20 at 04:52
  • @TimBiegeleisen actually good point -- I've provided poor sample data, let me revise that. – David542 May 14 '20 at 05:31

2 Answers2

1

Assuming the filenames always have exactly three components, SUBSTRING_INDEX can get the job done here:

SELECT word, COUNT(*) AS count
FROM
(
    SELECT SUBSTRING_INDEX(file_name, '_', 1) AS word FROM asset
    UNION ALL
    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(file_name, '_', 2), '_', -1) FROM asset
    UNION ALL
    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(file_name, '_', -1), '.', 1) FROM asset
) t
GROUP BY word;

Demo

Note: This answer was given based on the OP's original sample data, where all filenames had exactly three underscore-separate components. This answer will not work for the updated question.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • great answer, I've updated the question though, unfortunately there can be any number -- actually I think I've seen up to about 18 in a filename. – David542 May 14 '20 at 05:32
  • Then you are probably going to need to write a stored procedure to solve this. This is not a problem well suited to MySQL. – Tim Biegeleisen May 14 '20 at 05:34
  • 1
    yea currently I'm using `collections.Counter()` in python, but it would make things a lot quicker if I could push it all into the DB, but it seems like it's a task not too well suited to mysql. – David542 May 14 '20 at 05:35
  • See, the problem you have here is that logically a single point of data is one word. Best practice and normalization dictate that therefore each word should be in a separate record. But, you have 18+ words/points in a single record. This is the problem. – Tim Biegeleisen May 14 '20 at 05:36
  • yea, these are filenames/filepaths though so I can't do much about it -- for example, something that is stored on S3. Actually the point of this whole task is exactly what you mention -- to take the filepaths and to extract the data within it so I can make sense of it all. – David542 May 14 '20 at 05:40
  • 1
    I vote for scrubbing this data in your Python script _before_ bringing it into MySQL. – Tim Biegeleisen May 14 '20 at 05:41
1

The result you want can be achieved with a query derived from this answer, which uses a generated numbers table along with SUBSTRING_INDEX to split out all the words in each file_name. This is then used as a derived table to count the occurrence of each word. Note the numbers table must have sufficient values to cover the maximum number of words in a filename (12 for this sample data).

SELECT word, COUNT(*) 
FROM (
  SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(LEFT(file_name, LENGTH(file_name)-4), '_', numbers.n), '_', -1) AS word
  FROM (
    select 1 n union all
    select 2 union all select 3 union all select 4 union all
    select 5 union all select 6 union all select 7 union all
    select 8 union all select 9 union all select 10 union all
    select 11 union all select 12
  ) numbers
  JOIN asset ON LENGTH(file_name)
              - LENGTH(REPLACE(file_name, '_', '')) >= numbers.n - 1
) w
GROUP BY word

Output (for your sample data):

word        COUNT(*)
16X9        1
235         1
2398        1
720p        2
8CH         1
DISNEY      1
EN          1
FEATURE     1
FINAL       1
FRSUB       1
HD          2
L9714343    1
LG          1
MOVIE       2
MY          1
NEW         1
TYLERPERRY  1
VIDEO       1

Demo on dbfiddle

Nick
  • 138,499
  • 22
  • 57
  • 95
  • oh that's awesome, thank you very much for that! Though after running it, seems it's a bit simpler (and faster -- the non-indexed, on-the-fly join is pretty slow) doing it in python after retrieving the list of files, and also in that case I can add some more complex logic to it, etc. – David542 May 14 '20 at 06:25
  • @David542 yeah - this is not the sort of thing SQL is good at. I'm not surprised it's faster in python (easier to write too!). But I figured it's good to have something to compare against. – Nick May 14 '20 at 06:44