Automatically detect box/coordinates of burned-in subtitles in a video source

Question

In reality I'd like to detect the coordinates of the "biggest" (both in height and width) burned-in subtitle of a given video source. But in order to do this I first need to detect the box coordinates of every distinct subtitle in the sample video, and compare them to find the biggest one. I didn't know where to start about this, so the closest thing I found (sort of) was ffmpeg's bbox video filter which according to the documentation computes "the bounding box for the non-black pixels in the input frame luminance plane", based on a given luminance value:

ffmpeg -i input.mkv -vf bbox=min_val=130 -f null -

This gives me a line with coordinates for each input frame in the video, ex.:

[Parsed_bbox_0 @ 0ab734c0] n:123 pts:62976 pts_time:4.1 x1:173 x2:1106 y1:74 y2:694 w:934 h:621 crop=934:621:173:74 drawbox=173:74:934:621

The idea was to make a script and loop through the filter's output, detect the "biggest" box by comparing them all, and output its coordinates and frame number as representative of the longest subtitle.

The bbox filter though can't properly detect the subtitle box even in a relatively dark video with white hardsubs. By trial and error and only for a particular video sample which I used to run my tests, the "best" result for detecting the box of any subtitle was to use a min_val of 130 (supposedly the meaningful values of min_val are in the range of 0-255, although the docs don't say anything). Using the drawbox filter with ffplay to test the coordinates reported for a particular frame, I can see that it correctly detects only the bottom/left/right boundary of the subtitle, presumably because the outline of the globe in the image below is equally bright:

Raising min_val to 230 slightly breaks the previously correct boundaries at the bottom/left/right side:

And raising it to 240 gives me a weird result:

However even if I was able to achieve a perfect outcome with the bbox filter, this technique wouldn't be bulletproof for obvious reasons (the min_val should be arbitrarily chosen, the burned-in subtitles can be of different color, the image behind the subtitles can be equally or even more bright depending the video source, etc.).

So if possible I would like to know:

Is there a filter or another technique I can use with ffmpeg to do what I want
Is there perhaps another CLI tool or programming library to achieve this
Any hint that could help (perhaps I'm looking at the problem the wrong way)

Automatically detect box/coordinates of burned-in subtitles in a video source

0 Answers0