3

I have the following srt (subtitle) file:

import pysrt

srt = """
01
00:02:14,000 --> 00:02:18,000
I understand how customers do their choice. So

02
00:02:19,000 --> 00:02:24,000
what is the choice of packaging that they prefer when they have to pick up something in a shelf?

03
00:02:24,000 --> 00:02:29,000
What is the choice of the store where they will go shopping? What specific

04
00:02:29,000 --> 00:02:34,000
product they will purchase and also what is the brand that they will

05
00:02:34,000 --> 00:02:39,000
prefer. And of course many of the choices that are relevant in the context of marketing.
"""

As you can see the subtitles where weirdly split. I would prefer to have each subtitle end with a complete sentence, like so:

srt = """
01
00:02:14,000 --> 00:02:18,000
I understand how customers do their choice. 

02
00:02:19,000 --> 00:02:24,000
So what is the choice of packaging that they prefer when they have to pick up something in a shelf?

03
00:02:24,000 --> 00:02:29,000
What is the choice of the store where they will go shopping? 

04
00:02:29,000 --> 00:02:34,000
What specific product they will purchase and also what is the brand that they will prefer. 

05
00:02:34,000 --> 00:02:39,000
And of course many of the choices that are relevant in the context of marketing.
"""

I was wondering how to achieve this using Python. The subtitle text can be opened using pysrt:

import pysrt

srt = """
01
00:02:14,000 --> 00:02:18,000
I understand how customers do their choice. So

02
00:02:19,000 --> 00:02:24,000
what is the choice of packaging that they prefer when they have to pick up something in a shelf?

03
00:02:24,000 --> 00:02:29,000
What is the choice of the store where they will go shopping? What specific

04
00:02:29,000 --> 00:02:34,000
product they will purchase and also what is the brand that they will

05
00:02:34,000 --> 00:02:39,000
prefer. And of course many of the choices that are relevant in the context of marketing."""


with open("test.srt", "w") as text_file:
    text_file.write(srt)

sub = pysrt.open("test.srt")
text = sub.text

**EDIT: **

Based on @Chris answers, I tried:

from operator import itemgetter

srt = """
    01
    00:02:14,000 --> 00:02:18,000
    understand how customers do their choice. So

    02
    00:02:19,000 --> 00:02:24,000
    what is the choice of packaging that they prefer when they have to pick up something in a shelf?

    03
    00:02:24,000 --> 00:02:29,000
    What is the choice of the store where they will go shopping? What specific

    04
    00:02:29,000 --> 00:02:34,000
    product they will purchase and also what is the brand that they will

    05
    00:02:34,000 --> 00:02:39,000
    prefer. And of course many of the choices that are relevant in the context of marketing.
    """


l = [s.split('\n') for s in srt.strip().split('\n\n')]
whole = ' '.join(map(itemgetter(2), l))
for i, sen in enumerate(re.findall(r'([A-Z][^\.!?]*[\.!?])', whole)):
    l[i][2] = sen
print('\n\n'.join('\n'.join(s) for s in l))

but I get as a result, the exact same as the input...

01
    00:02:14,000 --> 00:02:18,000
    understand how customers do their choice. So

    02
    00:02:19,000 --> 00:02:24,000
    what is the choice of packaging that they prefer when they have to pick up something in a shelf?

    03
    00:02:24,000 --> 00:02:29,000
    What is the choice of the store where they will go shopping? What specific

    04
    00:02:29,000 --> 00:02:34,000
    product they will purchase and also what is the brand that they will

    05
    00:02:34,000 --> 00:02:39,000
    prefer. And of course many of the choices that are relevant in the context of marketing.

What am I doing wrong?

halfer
  • 19,824
  • 17
  • 99
  • 186
henry
  • 875
  • 1
  • 18
  • 48
  • 1
    @PaulRooney Good point ! Not really sure, how to do that, though as I don't know how long it takes to speak a certain amount of words. One could however find an average by dividing a given time period (i.e. `00:02:34,000 --> 00:02:39,000`) by the amount of letters in that time period. – henry May 14 '19 at 07:09

1 Answers1

1

This is bit messy, and can be error-prone, but works as expected:

from operator import itemgetter

l = [s.split('\n') for s in srt.strip().split('\n\n')]
whole = ' '.join(map(itemgetter(2), l))
for i, sen in enumerate(re.findall(r'([A-Z][^\.!?]*[\.!?])', whole)):
    l[i][2] = sen
print('\n\n'.join('\n'.join(s) for s in l))

Output:

01
00:02:14,000 --> 00:02:18,000
I understand how customers do their choice.

02
00:02:19,000 --> 00:02:24,000
So what is the choice of packaging that they prefer when they have to pick up something in a shelf?

03
00:02:24,000 --> 00:02:29,000
What is the choice of the store where they will go shopping?

04
00:02:29,000 --> 00:02:34,000
What specific product they will purchase and also what is the brand that they will prefer.

05
00:02:34,000 --> 00:02:39,000
And of course many of the choices that are relevant in the context of marketing.

Regex part reference: Regex to find all sentences of text?

Chris
  • 29,127
  • 3
  • 28
  • 51
  • +1 Nice answer. Do you see a way of also adjusting the time ? (Maybe with the method proposed by @henry in his commnet under his question.) –  May 14 '19 at 08:17
  • Hi !Thanks a lot for your answer. I tried to implement it (see EDIT), but it's not working for me. Do you see why ? Thanks a lot. – henry May 14 '19 at 08:36
  • @henry I've added `strip`. Can you give it another try? – Chris May 14 '19 at 09:13
  • @Chris Tried it with the new version, but it still doesn't work... Have a look at my updated EDIT. Thanks – henry May 14 '19 at 09:59
  • @henry I think the srt you are using has some leading whitespaces at each line as opposed to the original example you posted. Can you try with the very first srt where there is no indent on every line? – Chris May 14 '19 at 10:11
  • 1
    Yes. You are right. Works now ! :)) I am not very familiar with regex. Could you just tell me the main idea behind your code ? – henry May 14 '19 at 11:10
  • @henry `re.findall` part basically looks for a chunk of `str` that starts with a capital letter, `[A-Z]`, then anything except the punctuation, then ends with the punctuation, `[\.?!]`. – Chris May 15 '19 at 02:11
  • Nice. Thanks a lot ! – henry May 15 '19 at 13:38
  • I have opened up a new question to deal with the timing problem: https://stackoverflow.com/questions/56156385/reformat-subtitle-text-and-time-to-end-with-complete-sentence – henry May 15 '19 at 19:28
  • @james Please have a look here: https://stackoverflow.com/questions/56156385/reformat-subtitle-text-and-time-to-end-with-complete-sentence – henry May 15 '19 at 19:28
  • @henry Thanks. Will check it out. :) –  May 15 '19 at 19:29