1

this time I tried to use Python's xlsxwriter module to write data from a .srt into an excel.

The subtitle file looks like this in sublime text:

but I want to write the data into an excel, so it looks like this:

It's my first time to code python for this, so I'm still in the stage of trial and error...I tried to write some code like below

but I don't think it makes sense...

I'll continue trying out, but if you know how to do it, please let me know. I'll read your code and try to understand them! Thank you! :)

Penny
  • 1,218
  • 1
  • 13
  • 32
  • 2
    Is the intention really for multiline subtitle text to end up in different rows in excel? What should happen if the subtitle text occupies more than 3 lines? – lemonhead Aug 30 '15 at 04:41
  • Also, it would be awesome if you could update your question with the pasted in text itself as code blocks from sublime (highlight and Ctrl+K to format text as code block). Would be helpful for reproducing / cutting and pasting what you have – lemonhead Aug 30 '15 at 04:43

1 Answers1

3

The following breaks the problem into a few pieces:

  • Parsing the input file. parse_subtitles is a generator that takes a source of lines and yields up a sequence of records in the form {'index':'N', 'timestamp':'NN:NN:NN,NNN -> NN:NN:NN,NNN', 'subtitle':'TEXT'}'. The approach I took was to track which of three distinct states we're in:
    1. seeking to next entry for when we're looking for the next index number, which should match the regular expression ^\d*$ (nothing but a bunch of numbers)
    2. looking for timestamp when an index is found and we expect a timestamp to come in the next line, which should match the regular expression ^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$ (HH:MM:SS,mmm -> HH:MM:SS,mmm) and
    3. reading subtitles while consuming actual subtitle text, with blank lines and EOF interpreted as subtitle termination points.
  • Writing the above records to a row in a worksheet. write_dict_to_worksheet accepts a row and worksheet, plus a record and a dictionary defining the Excel 0-indexed column numbers for each of the record's keys, and then it writes the data appropriately.
  • Organizaing the overall conversion convert accepts an input filename (e.g. 'Wildlife.srt' that'll be opened and passed to the parse_subtitles function, and an output filename (e.g. 'Subtitle.xlsx' that will be created using xlsxwriter. It then writes a header and, for each record parsed from the input file, writes that record to the XLSX file.

Logging statements left in for self-commenting purposes, and because when reproducing your input file I fat-fingered a : to a ; in a timestamp, making it unrecognized, and having the error pop up was handy for debugging!

I've put a text version of your source file, along with the below code, in this Gist

import xlsxwriter
import re
import logging

def parse_subtitles(lines):
    line_index = re.compile('^\d*$')
    line_timestamp = re.compile('^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$')
    line_seperator = re.compile('^\s*$')

    current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
    state = 'seeking to next entry'

    for line in lines:
        line = line.strip('\n')
        if state == 'seeking to next entry':
            if line_index.match(line):
                logging.debug('Found index: {i}'.format(i=line))
                current_record['index'] = line
                state = 'looking for timestamp'
            else:
                logging.error('HUH: Expected to find an index, but instead found: [{d}]'.format(d=line))

        elif state == 'looking for timestamp':
            if line_timestamp.match(line):
                logging.debug('Found timestamp: {t}'.format(t=line))
                current_record['timestamp'] = line
                state = 'reading subtitles'
            else:
                logging.error('HUH: Expected to find a timestamp, but instead found: [{d}]'.format(d=line))

        elif state == 'reading subtitles':
            if line_seperator.match(line):
                logging.info('Blank line reached, yielding record: {r}'.format(r=current_record))
                yield current_record
                state = 'seeking to next entry'
                current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
            else:
                logging.debug('Appending to subtitle: {s}'.format(s=line))
                current_record['subtitles'].append(line)

        else:
            logging.error('HUH: Fell into an unknown state: `{s}`'.format(s=state))
    if state == 'reading subtitles':
        # We must have finished the file without encountering a blank line. Dump the last record
        yield current_record

def write_dict_to_worksheet(columns_for_keys, keyed_data, worksheet, row):
    """
    Write a subtitle-record to a worksheet. 
    Return the row number after those that were written (since this may write multiple rows)
    """
    current_row = row
    #First, horizontally write the entry and timecode
    for (colname, colindex) in columns_for_keys.items():
        if colname != 'subtitles': 
            worksheet.write(current_row, colindex, keyed_data[colname])

    #Next, vertically write the subtitle data
    subtitle_column = columns_for_keys['subtitles']
    for morelines in keyed_data['subtitles']:
        worksheet.write(current_row, subtitle_column, morelines)
        current_row+=1

    return current_row

def convert(input_filename, output_filename):
    workbook = xlsxwriter.Workbook(output_filename)
    worksheet = workbook.add_worksheet('subtitles')
    columns = {'index':0, 'timestamp':1, 'subtitles':2}

    next_available_row = 0
    records_processed = 0
    headings = {'index':"Entries", 'timestamp':"Timecodes", 'subtitles':["Subtitles"]}
    next_available_row=write_dict_to_worksheet(columns, headings, worksheet, next_available_row)

    with open(input_filename) as textfile:
        for record in parse_subtitles(textfile):
            next_available_row = write_dict_to_worksheet(columns, record, worksheet, next_available_row)
            records_processed += 1

    print('Done converting {inp} to {outp}. {n} subtitle entries found. {m} rows written'.format(inp=input_filename, outp=output_filename, n=records_processed, m=next_available_row))
    workbook.close()

convert(input_filename='Wildlife.srt', output_filename='Subtitle.xlsx')

Edit: Updated to split multiline subtitles across multiple rows in output

Bryant
  • 622
  • 4
  • 18
  • Hi Bryant, thank you very much for writing the code! I tried it and it seems like they all generate errors. So only the headers are there in the Subtitle.xlsx. Also, your guess is right, I have to separate the multiline subtitles in stead of merging them into one column. can I also ask what is the "line" and "lines" refer to in your code "for line in lines"? Thank you! :) – Penny Sep 02 '15 at 05:27
  • @Penny re:"for line in lines": The `parse_subtitles` function expects its parameter `lines` to be provided with an iterable (a list, dictionary, generator, etc) that supplies lines of text. The `convert` function supplies an opened file as the argument to the `parse_subtitles` function, and [Python lets files be used as an iterable source of lines-in-the-file](http://stackoverflow.com/a/10255293/5265467). // The `line` variable holds the current element of `lines` that the [for loop is processing](https://docs.python.org/3.4/tutorial/controlflow.html#for-statements). – Bryant Sep 02 '15 at 14:39
  • @Penny re: multiline separation: Is the output goal to have each timestamp appear only once (e.g. one spreadsheet row with index/timestamp/text_line_1, and then the next spreadsheet row with blank/blank/test_line_2) or to have each line of text bear complete index/timestamp information, possibly with an extra bit of order-of-lines-in-this-block information? (e.g. one row with index/timestamp/text_line_1/1, the next row with index/timestamp/text_line_2/2) – Bryant Sep 02 '15 at 14:42
  • @Penny re:errors: I've successfully run the code on Python 3.4.1 (Anaconda 2.1.0 64 bit distribution) on Windows 7. Without knowing what errors you're running into and what's different about your setup than mine, I can't provide any specific recommendations. – Bryant Sep 02 '15 at 15:01