0

i am using a python script with regex module trying to process 2 files and create a final output as required but getting some errors.

cat links.txt

https://videos-a.jwpsrv.com/content/conversions/7kHOkkQa/videos/XXXXJD8C-32313922.mp4.m3u8?hdnts=exp=1596554537~acl=*/bGxpJD8C-32313922.mp4.m3u8~hmac=2ac95222f1693d11e7fd8758eb0a18d6d2ee187bb10e3c27311e627785687bd5
https://videos-a.jwpsrv.com/content/conversions/7kHOkkQa/videos/XXXXkxI1-32313922.mp4.m3u8?hdnts=exp=1596554733~acl=*/bM07kxI1-32313922.mp4.m3u8~hmac=dd0fc6f433a8ac74c9eaa2a376fa4324a65ae7c410cdcf8e869c6961f1a5b5ea
https://videos-a.jwpsrv.com/content/conversions/7kHOkkQa/videos/XXXXpGKZ-32313922.mp4.m3u8?hdnts=exp=1596554748~acl=*/onhIpGKZ-32313922.mp4.m3u8~hmac=d4030cf7813cef02a58ca17127a0bc6b19dc93cccd6add4edc72a2ee5154f236
https://videos-a.jwpsrv.com/content/conversions/7kHOkkQa/videos/XXXXLbgy-32313922.mp4.m3u8?hdnts=exp=1596554871~acl=*/xGXCLbgy-32313922.mp4.m3u8~hmac=7c515306c033c88d32072d54ba1d6aa4abf1be23070d1bb14d1311e4e74cc1d7

cat name.txt

Introduction Lecture 1
Questions Lecture 1B
Theory Lecture 2
Labour Costing Lecture 352 (Classroom Lecture)

Expected ( final.txt )

https://cdn.jwplayer.com/vidoes/XXXXJD8C-32313922.mp4
  out=Lecture 001- Introduction.mp4
https://cdn.jwplayer.com/vidoes/XXXXkxI1-32313922.mp4
  out=Lecture 001B- Questions.mp4
https://cdn.jwplayer.com/vidoes/XXXXpGKZ-32313922.mp4
  out=Lecture 002- Theory.mp4
https://cdn.jwplayer.com/vidoes/XXXXLbgy-32313922.mp4
  out=Lecture 352- Labour Costing (Classroom Lecture).mp4

cat sort.py ( my existing script )

import re

final = open('final.txt','w')
a = open('links.txt','r')
b = open('name.txt','r')
base = 'https://cdn.jwplayer.com/videos/'
kek = re.compile(r'(?<=\/)[\w\-\.]+(?=.m3u8)')
# find max lecture number
n = None
for line in b:
    b_n = int(''.join([c for c in line.rpartition(' ')[2] if c in '1234567890']))
    if n is None or b_n > n:
        n = b_n
n = len(str(n))  # string len of the max lecture number
    
b = open('name.txt','r')
for line in a:
    final.write(base + kek.search(line).group() + '\n')
    b_line = b.readline().rstrip()
    line_before_lecture, _, lecture = b_line.partition('Lecture')
    line_before_lecture = line_before_lecture.strip()
    lecture_no = lecture.rpartition(' ')[2]
    lecture_str = lecture_no.rjust(n, '0') + '-' + " " + line_before_lecture
    final.write('  out=' + 'Lecture ' + lecture_str + '.mp4\n')

Traceback

Traceback (most recent call last):
  File "sort.py", line 11, in <module>
    b_n = int(''.join([c for c in line.rpartition(' ')[2] if c in '1234567890']))
ValueError: invalid literal for int() with base 10: ''

Edit - It seems that the error is due to the last line in name.txt as my script assumes all lines in name.txt would end in format of Lecture X.

One way to fix it i guess is to edit the script and add a if condition as follows :

If any line in name.txt doesn't end in format - Lecture X , then shift the text succeeding Lecture X prior to word Lecture.

Example the 4th line of name.txt Labour Costing Lecture 352 (Classroom Lecture) Could be converted to Labour Costing (Classroom Lecture) Lecture 352 and edit the below line in my script to match only the last occurrence of "Lecture" in a line in name.txt

line_before_lecture, _, lecture = b_line.partition('Lecture')

i basically need the expected output ( final.txt ) from those 2 files ( names.txt and links.txt ) using the script , if there's a better/smart way to do it , i would definitely be happy to use it. I just theoretically suggested one way of doing it which i have no clue how to do it myself

Sachin
  • 1,217
  • 2
  • 11
  • 31
  • it's saying you don't have a valid int. It's not in 0123456789, so it's probably an empty string trying to be cast as an int. You could always put an extra if to assign it to something (0?) if the string has 0 length. – user3452643 Aug 12 '20 at 17:07
  • It's not clear what code you want written; sounds like you should create a new question with a detailed example of your current code, the current output, and the expected output. But Stack Overflow is not a "please write code for me" service; you will need to show us your best effort. (Hint: `split('Lecture X')` – tripleee Aug 13 '20 at 08:04
  • @tripleee my 1) current code 2) current error 3) 2 files 4) traceback and 5) expected output all are there in question already. What help do i need with edit in existing script , that also i have explained at bottom of question .Sorry but I don't know how can i explain or put it better , even if i created a new question, it would be exact same as this – Sachin Aug 13 '20 at 08:11
  • *"shift the text succeeding Lecture X prior to word Lecture*" isn't very clear; you really need to include an example. – tripleee Aug 13 '20 at 08:17
  • @tripleee i have edited the question. Also that's just one possible way to fix it that i could be think of. I am sure many people might know lot better and smart ways to go about it ..i just suggested what i felt could possibly fix the issue , its not necessary to do it that way to be honest – Sachin Aug 13 '20 at 08:45

2 Answers2

1

If you are using regular expressions anyway, why not use them to pull out this information, too?

import re

base = 'https://cdn.jwplayer.com/videos/'
kek = re.compile(r'(?<=\/)[\w\-\.]+(?=.m3u8)')
nre = re.compile(r'(.*)\s+Lecture (\d+)(.*)')

with open('name.txt') as b:
  lecture = []
  for line in b:
    parsed = nre.match(line)
    if parsed:
      lecture.append((int(parsed.group(2)), parsed.group(3), parsed.group(1)))
    else:
      raise ValueError('Unable to parse %r' % line)

n = len(str(lecture[-1][0]))

with open('links.txt','r') as a:
  for idx, line in enumerate(a):
    print(base + kek.search(line).group())
    fmt='  out=Lecture {0:0' + str(n) + 'n}{1}- {2}.mp4'
    print(fmt.format(*lecture[idx]))

This only traverses the contents in name.txt once, and stores the results in a variable lecture which contains a tuple of the pieces we pulled out (number, suffix, title).

I also changed this to write to standard output; redirect to a file if you like, or switch back to explicitly hard-coding the output file in the script itself.

The splat syntax *lecture is just a shorthand to avoid having to write lecture[0], lecture[1], lecture[2] explicitly.

Demo: https://repl.it/repls/TatteredInexperiencedFibonacci#main.py

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Thank you so much for the regex answer .The expected output had a very small typo in question , i have edited it .. i apologize for not noticing it earlier. Can you please reflect the same in your code the last line basically ..out=Lecture 352 (Classroom Lecture)- Labour Costing.mp4 to out=Lecture 352- Labour Costing (Classroom Lecture).mp4 – Sachin Aug 13 '20 at 10:16
  • I can suggest how to make further modifications but really, Stack Overflow is for learning stuff, not for getting your job done for free. If you managed to come up with that URL regex, surely you can piece together a small modification to this code (and your requirements are unclear anyway - do you mean allow an optional alphabetic next to the lecture number but move anything else to the title?) – tripleee Aug 13 '20 at 10:42
  • i tried editing it myself but i am not really well versed with regex yet . Yes, if there's anything after Lecture X in name.txt , just simply merge it in the main title itself .. out=Lecture No- Main Title + Additional title ie part after Lecture X ( if any).mp4 – Sachin Aug 13 '20 at 10:48
  • Try `(.*)\s+Lecture (\d+)(\w*)(.*)` and adapt the code to extract four groups instead of three. – tripleee Aug 13 '20 at 11:45
0

The issue is with the last line of cat names.txt.

>>> line = "Labour Costing Lecture 352 (Classroom Lecture)"
>>> [c for c in line.rpartition(' ')[2]]
['L', 'e', 'c', 't', 'u', 'r', 'e', ')']

Clearly not what you are intending to extract. Since none of these is a number, it returns an empty string which cannot be cast to an int. If you are looking to extract the int, I would suggest looking at this question: How to extract numbers from a string in Python?

backcab
  • 638
  • 1
  • 6
  • 21
  • More specifically, `line.rpartition(' ')[2]` assumes that the number he wants is always the last word on the line. – Barmar Aug 12 '20 at 17:43