The following breaks the problem into a few pieces:
- Parsing the input file.
parse_subtitles
is a generator that takes a source of lines and yields up a sequence of records in the form {'index':'N', 'timestamp':'NN:NN:NN,NNN -> NN:NN:NN,NNN', 'subtitle':'TEXT'}'
. The approach I took was to track which of three distinct states we're in:
seeking to next entry
for when we're looking for the next index number, which should match the regular expression ^\d*$
(nothing but a bunch of numbers)
looking for timestamp
when an index is found and we expect a timestamp to come in the next line, which should match the regular expression ^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$
(HH:MM:SS,mmm -> HH:MM:SS,mmm) and
reading subtitles
while consuming actual subtitle text, with blank lines and EOF interpreted as subtitle termination points.
- Writing the above records to a row in a worksheet.
write_dict_to_worksheet
accepts a row and worksheet, plus a record and a dictionary defining the Excel 0-indexed column numbers for each of the record's keys, and then it writes the data appropriately.
- Organizaing the overall conversion
convert
accepts an input filename (e.g. 'Wildlife.srt'
that'll be opened and passed to the parse_subtitles
function, and an output filename (e.g. 'Subtitle.xlsx'
that will be created using xlsxwriter
. It then writes a header and, for each record parsed from the input file, writes that record to the XLSX file.
Logging statements left in for self-commenting purposes, and because when reproducing your input file I fat-fingered a :
to a ;
in a timestamp, making it unrecognized, and having the error pop up was handy for debugging!
I've put a text version of your source file, along with the below code, in this Gist
import xlsxwriter
import re
import logging
def parse_subtitles(lines):
line_index = re.compile('^\d*$')
line_timestamp = re.compile('^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$')
line_seperator = re.compile('^\s*$')
current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
state = 'seeking to next entry'
for line in lines:
line = line.strip('\n')
if state == 'seeking to next entry':
if line_index.match(line):
logging.debug('Found index: {i}'.format(i=line))
current_record['index'] = line
state = 'looking for timestamp'
else:
logging.error('HUH: Expected to find an index, but instead found: [{d}]'.format(d=line))
elif state == 'looking for timestamp':
if line_timestamp.match(line):
logging.debug('Found timestamp: {t}'.format(t=line))
current_record['timestamp'] = line
state = 'reading subtitles'
else:
logging.error('HUH: Expected to find a timestamp, but instead found: [{d}]'.format(d=line))
elif state == 'reading subtitles':
if line_seperator.match(line):
logging.info('Blank line reached, yielding record: {r}'.format(r=current_record))
yield current_record
state = 'seeking to next entry'
current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
else:
logging.debug('Appending to subtitle: {s}'.format(s=line))
current_record['subtitles'].append(line)
else:
logging.error('HUH: Fell into an unknown state: `{s}`'.format(s=state))
if state == 'reading subtitles':
# We must have finished the file without encountering a blank line. Dump the last record
yield current_record
def write_dict_to_worksheet(columns_for_keys, keyed_data, worksheet, row):
"""
Write a subtitle-record to a worksheet.
Return the row number after those that were written (since this may write multiple rows)
"""
current_row = row
#First, horizontally write the entry and timecode
for (colname, colindex) in columns_for_keys.items():
if colname != 'subtitles':
worksheet.write(current_row, colindex, keyed_data[colname])
#Next, vertically write the subtitle data
subtitle_column = columns_for_keys['subtitles']
for morelines in keyed_data['subtitles']:
worksheet.write(current_row, subtitle_column, morelines)
current_row+=1
return current_row
def convert(input_filename, output_filename):
workbook = xlsxwriter.Workbook(output_filename)
worksheet = workbook.add_worksheet('subtitles')
columns = {'index':0, 'timestamp':1, 'subtitles':2}
next_available_row = 0
records_processed = 0
headings = {'index':"Entries", 'timestamp':"Timecodes", 'subtitles':["Subtitles"]}
next_available_row=write_dict_to_worksheet(columns, headings, worksheet, next_available_row)
with open(input_filename) as textfile:
for record in parse_subtitles(textfile):
next_available_row = write_dict_to_worksheet(columns, record, worksheet, next_available_row)
records_processed += 1
print('Done converting {inp} to {outp}. {n} subtitle entries found. {m} rows written'.format(inp=input_filename, outp=output_filename, n=records_processed, m=next_available_row))
workbook.close()
convert(input_filename='Wildlife.srt', output_filename='Subtitle.xlsx')
Edit: Updated to split multiline subtitles across multiple rows in output