1

I'm currently trying to write a script to help me format simple txt of video script from sth like

1 00:00:00,000 --> 00:00:03,550 text1

2 00:00:03,550 --> 00:00:07,030 text2

to "text1 text2". I have more than 100 separate files and I am trying to write all of them together into one.

So I wrote sth like:

import re
import os

path = r'the full path of the directory'
f = open("video_script.txt", 'w')

for filename in os.listdir(path):
    text = open(filename).read()

    textblock = reduce(lambda x,y: x+y+' ', re.findall('([a-zA-z].*)\r', text))
    newtext = textblock.replace('. ', '.\n')

    f.write ('*'+filename+'*')
    f.write ('\n') 
    f.write(newtext)
    f.write('\n'*2)

f.close()

I got the code fun successfully for about 30 files then I got an error of:

TypeError: reduce() of empty sequence with no initial value 

I run a separate test on that failed one and there was no error. Thanks for any help.

Esther
  • 13
  • 3
  • "for about 30 files": Is it the same set of files every time? – Jeroen Heier Jul 28 '18 at 05:33
  • The problem is that the regex is *not* matching anything in the file. This is probably due to the fact that `\r` will never match a carriage return because by default python opens files in "universal newline mode" which means that all `\r\n`, `\r` and `\n` are mapped to just `\n`. If you want to match a carriage return you need to open the file in binary mode `open(filename, 'b')` and use a binary regex (e.g. `b'([a-zA-z].*)\r'`, note the prefix `b` to the byte string). – Bakuriu Jul 28 '18 at 05:52
  • FYI: [`[a-zA-z]` does not only match letters](https://stackoverflow.com/questions/29771901/why-is-this-regex-allowing-a-caret#29771926). – Wiktor Stribiżew Jul 28 '18 at 11:15
  • @Jeroen Heier: Yes, it stuck at the same every time so I ran a separate test on that file which worked. – Esther Jul 28 '18 at 14:32
  • @Bakuriu: After I changed the `\r` to `\n` , the code ran without an error. But then for `re.findall('([a-zA-z].*)\r', text))` I would get a list of string ending with `\r` like ["It used to be that you wouldn't\r", 'even think about\r']. How could I fix that? – Esther Jul 28 '18 at 14:38
  • @WiktorStribiżew: Thanks for the reminder. I did not know it would match any character but it did filter out things like "00:00:00,000 --> 00:00:03,550". I don't think I understand... – Esther Jul 28 '18 at 14:41
  • Try `re.findall('[a-zA-Z][^\r\n]+', text))` if the only problem is CR. – Wiktor Stribiżew Jul 28 '18 at 16:11
  • @WiktorStribiżew: Thank you so much. It works. – Esther Jul 28 '18 at 17:09

1 Answers1

1

You seem to want to match any chars other than CR and LF after an ASCII letter. The . matches CR symbols, and does not help in this case. You may use

re.findall('[a-zA-Z][^\r\n]+', text))

Details

  • [a-zA-Z] - an ASCII letter (to match any Unicode letter, use [^\W\d_])
  • [^\r\n]+ - one or more (+) chars other than CR and LF ([^...] is a negated character class matching any char other than the char set(s)/range(s) defined in the character class).
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563