I am trying to get a list of strings that are always between two words. Those words are: Subscribe to and unsubscribeAccessibility. The infile comes from viewing page source and copying and pasting into a text file.
My program is partially working but it is returning too much information. It is also not writing each string to a new line.
Here is my code so far:
#reads from text file copy of HTML source code and finds word/s in between two keywords. The word/s are all YouTube subscription channel names.
#Writes subscriptions found to an outfile called Subs_List.txt
import re #import regular expression
with open("Subs_HTML_Code.txt") as in_file, open("Subs_List.txt", 'w') as out_file: #read from infile and write to outfile
for line in in_file: #read each line in the infile, Subs_HTML_Code.txt
try: #try this for each line read from infile
subscriptionFound = re.search("Subscribe to(.+)unsubscribeAccessibility", line).group() #search for all text between two keywords
out_file.write(subscriptionFound+"\n") #if a match is made write it to outfile on new line
print(subscriptionFound) #print to console each subscription found
except: #no matches found
out_file.write("No subscription found!\n")
print("No subscription found!")
Here is what the infile, Subs_HTML_Code.txt looks like:
Subscribe to Fred's Auto Repair."}},"unsubscribeAccessibility":Subscribe to Fred's Auto Repair."}},"unsubscribeAccessibility":{"accessibilityData":{"label":"Unsubscribe from
unbscribeText":{"simpleText":"31.4K"},"subscribedButtonText":{"runs":[{"text":"Subscribed"}]},"unsubscribedButtonText":{"runs":Subscribe to Fred's Auto Repair
."}},"unsubscribeAccessibility":{"accessibilityData":{"label":"Unsubscribe from [{"text":"Unsubscribe"}]},"longSubscriberCountText":{"runs":[{"text":"31.4K
Unsubscribe from [{"text":"Unsubscribe"}]},"longSubscriberCountText":{"runs":[{"text":"31.4KSubscribe to Fred's Auto Repair."}},"unsubscribeAccessibility":
{"accessibilityData":{"label":"Unsubscribe from [{"text":"Unsubscribe"}]},"longSubscriberCountText":{"runs":[{"text":"31.4K
And here is what my program gets me. Written to the outfile, Subs_List.txt looks like:
Subscribe to Fred's Auto Repair."}},"unsubscribeAccessibility":Subscribe to Fred's Auto Repair."}},"unsubscribeAccessibility":{"accessibilityData":{"label":"Unsubscribe from unbscribeText":{"simpleText":"31.4K"},"subscribedButtonText":{"runs":[{"text":"Subscribed"}]},"unsubscribedButtonText":{"runs":Subscribe to Fred's Auto Repair ."}},"unsubscribeAccessibility
No subscription found!
Subscribe to Fred's Auto Repair."}},"unsubscribeAccessibility
I don't know why I'm getting a lot of text after Fred's Auto repair. This is okay: Fred's Auto Repair."}}," But I am getting one of the keywords in there too.
Also if more than once instance of the string appears on the same line, my program writes it to the same line. Why?