0

Given the following string:

00:00:03:13 9420 9454 5bcb 45d9 c24f c152 c449 cec7 94f2 c1ce c420 434f cdd0 d554 4552 20ce 4f49 d345 d35d

How would I convert this to text?

David542
  • 104,438
  • 178
  • 489
  • 842
  • The beginning does not look like 2-octets as hexadecimal. – Deduplicator Jan 26 '15 at 23:32
  • Any particular programming language ? If not then the `language-agnostic` tag would probably be appropriate. – Paul R Jan 26 '15 at 23:32
  • @PaulR any language would work here -- my preference would be for python. – David542 Jan 26 '15 at 23:57
  • Perahps with something like [this](http://stackoverflow.com/a/4296727/355230)? – martineau Jan 27 '15 at 00:02
  • What does SCC stand for? – martineau Jan 27 '15 at 00:06
  • @martineau here's some more info: http://www.theneitherworld.com/mcpoodle/SCC_TOOLS/DOCS/SCC_FORMAT.HTML – David542 Jan 27 '15 at 00:20
  • Well, from what that says the data made up of 2-byte commands and single byte characters. It's possible someone has already implemented a decoder for this, so you should look around for Python module that handles it. Otherwise you're going to need to use the specification format and write something yourself to do it. From what I read it sounds completely feasible to do in Python -- and probably not overly difficult given the detailed format information available. – martineau Jan 27 '15 at 01:59
  • BTW It doesn't sound like the text is unicode, more like iso-8859-1 to me. – martineau Jan 27 '15 at 02:18

2 Answers2

2

pycaption is a library I found. pip install pycaption and try to parse your sample:

from pycaption import SCCReader
input = '00:00:03:13 9420 9454 5bcb 45d9 c24f c152 c449 cec7 94f2 c1ce c420 434f cdd0 d554 4552 20ce 4f49 d345 d35d'
contents = SCCReader().read(input.decode('utf-8'))
contents.get_captions('en-US')

You will get errors:

pycaption.exceptions.CaptionReadNoCaptions: CaptionReadNoCaptions((u'empty caption file',))

That's because SCC doesn't only contain encoded text, it also has commands inside. The first 2 bytes 9420 means "start pop-on caption". A tailing 942f (End Of Caption) is expected in order to show the text properly. But it's missing in your sample. I think it's in following parts that you didn't paste. Meanwhile, first line of SCC file should be version format. Let's add 2 lines into your sample:

input = '''Scenarist_SCC V1.0

00:00:03:13 9420 9454 5bcb 45d9 c24f c152 c449 cec7 94f2 c1ce c420 434f cdd0 d554 4552 20ce 4f49 d345 d35d

00:00:04:00 942f
'''

Then the output would be:

[u'00:00:04.037 --> 00:00:00.000\n[KEYBOARDING\nAND COMPUTER NOISES]']
ZZY
  • 3,689
  • 19
  • 22
0

To obtain plain text, you need to convert hex codes to letters.

Example: 5bcb 45d9 >>translates to>> "[KEY"
5b - "["
cb - "K"
45 - "E"
d9 - "Y"

Also, there are control codes like 9420. That are not text.

To find the translation table for letters, check here: https://github.com/mantas-done/subtitles/blob/v1.0.14/src/Code/Converters/SccConverter.php#L843

Also a bit higher there is a table for control codes, that you can skip when parsing text.

Mantas D
  • 3,993
  • 3
  • 26
  • 26