Convert SCC (two-byte hexadecimal words) to string

Question

Given the following string:

00:00:03:13 9420 9454 5bcb 45d9 c24f c152 c449 cec7 94f2 c1ce c420 434f cdd0 d554 4552 20ce 4f49 d345 d35d

How would I convert this to text?

Any particular programming language ? If not then the `language-agnostic` tag would probably be appropriate. — Paul R, Jan 26 '15 at 23:32
@PaulR any language would work here -- my preference would be for python. — David542, Jan 26 '15 at 23:57
Perahps with something like [this](http://stackoverflow.com/a/4296727/355230)? — martineau, Jan 27 '15 at 00:02
@martineau here's some more info: http://www.theneitherworld.com/mcpoodle/SCC_TOOLS/DOCS/SCC_FORMAT.HTML — David542, Jan 27 '15 at 00:20
Well, from what that says the data made up of 2-byte commands and single byte characters. It's possible someone has already implemented a decoder for this, so you should look around for Python module that handles it. Otherwise you're going to need to use the specification format and write something yourself to do it. From what I read it sounds completely feasible to do in Python -- and probably not overly difficult given the detailed format information available. — martineau, Jan 27 '15 at 01:59
BTW It doesn't sound like the text is unicode, more like iso-8859-1 to me. — martineau, Jan 27 '15 at 02:18

score 2 · Accepted Answer · answered Jan 27 '15 at 03:12

pycaption is a library I found. pip install pycaption and try to parse your sample:

from pycaption import SCCReader
input = '00:00:03:13 9420 9454 5bcb 45d9 c24f c152 c449 cec7 94f2 c1ce c420 434f cdd0 d554 4552 20ce 4f49 d345 d35d'
contents = SCCReader().read(input.decode('utf-8'))
contents.get_captions('en-US')

You will get errors:

pycaption.exceptions.CaptionReadNoCaptions: CaptionReadNoCaptions((u'empty caption file',))

That's because SCC doesn't only contain encoded text, it also has commands inside. The first 2 bytes 9420 means "start pop-on caption". A tailing 942f (End Of Caption) is expected in order to show the text properly. But it's missing in your sample. I think it's in following parts that you didn't paste. Meanwhile, first line of SCC file should be version format. Let's add 2 lines into your sample:

input = '''Scenarist_SCC V1.0

00:00:03:13 9420 9454 5bcb 45d9 c24f c152 c449 cec7 94f2 c1ce c420 434f cdd0 d554 4552 20ce 4f49 d345 d35d

00:00:04:00 942f
'''

Then the output would be:

[u'00:00:04.037 --> 00:00:00.000\n[KEYBOARDING\nAND COMPUTER NOISES]']

score 0 · Answer 2 · answered Aug 15 '23 at 07:10

To obtain plain text, you need to convert hex codes to letters.

Example: 5bcb 45d9 >>translates to>> "[KEY"
5b - "["
cb - "K"
45 - "E"
d9 - "Y"

Also, there are control codes like 9420. That are not text.

To find the translation table for letters, check here: https://github.com/mantas-done/subtitles/blob/v1.0.14/src/Code/Converters/SccConverter.php#L843

Also a bit higher there is a table for control codes, that you can skip when parsing text.

Convert SCC (two-byte hexadecimal words) to string

2 Answers2