Given the following string:
00:00:03:13 9420 9454 5bcb 45d9 c24f c152 c449 cec7 94f2 c1ce c420 434f cdd0 d554 4552 20ce 4f49 d345 d35d
How would I convert this to text?
Given the following string:
00:00:03:13 9420 9454 5bcb 45d9 c24f c152 c449 cec7 94f2 c1ce c420 434f cdd0 d554 4552 20ce 4f49 d345 d35d
How would I convert this to text?
pycaption is a library I found. pip install pycaption
and try to parse your sample:
from pycaption import SCCReader
input = '00:00:03:13 9420 9454 5bcb 45d9 c24f c152 c449 cec7 94f2 c1ce c420 434f cdd0 d554 4552 20ce 4f49 d345 d35d'
contents = SCCReader().read(input.decode('utf-8'))
contents.get_captions('en-US')
You will get errors:
pycaption.exceptions.CaptionReadNoCaptions: CaptionReadNoCaptions((u'empty caption file',))
That's because SCC doesn't only contain encoded text, it also has commands inside. The first 2 bytes 9420
means "start pop-on caption". A tailing 942f
(End Of Caption) is expected in order to show the text properly. But it's missing in your sample. I think it's in following parts that you didn't paste. Meanwhile, first line of SCC file should be version format. Let's add 2 lines into your sample:
input = '''Scenarist_SCC V1.0
00:00:03:13 9420 9454 5bcb 45d9 c24f c152 c449 cec7 94f2 c1ce c420 434f cdd0 d554 4552 20ce 4f49 d345 d35d
00:00:04:00 942f
'''
Then the output would be:
[u'00:00:04.037 --> 00:00:00.000\n[KEYBOARDING\nAND COMPUTER NOISES]']
To obtain plain text, you need to convert hex codes to letters.
Example: 5bcb 45d9 >>translates to>> "[KEY"
5b - "["
cb - "K"
45 - "E"
d9 - "Y"
Also, there are control codes like 9420. That are not text.
To find the translation table for letters, check here: https://github.com/mantas-done/subtitles/blob/v1.0.14/src/Code/Converters/SccConverter.php#L843
Also a bit higher there is a table for control codes, that you can skip when parsing text.