0

I am trying to get a list of strings that are always between two words. Those words are: Subscribe to and unsubscribeAccessibility. The infile comes from viewing page source and copying and pasting into a text file.

My program is partially working but it is returning too much information. It is also not writing each string to a new line.

Here is my code so far:

#reads from text file copy of HTML source code and finds word/s in between two keywords. The word/s are all YouTube subscription channel names.
#Writes subscriptions found to an outfile called Subs_List.txt
import re   #import regular expression

with open("Subs_HTML_Code.txt") as in_file, open("Subs_List.txt", 'w') as out_file: #read from infile and write to outfile

    for line in in_file: #read each line in the infile, Subs_HTML_Code.txt

            try: #try this for each line read from infile
                subscriptionFound = re.search("Subscribe to(.+)unsubscribeAccessibility", line).group() #search for all text between two keywords
                out_file.write(subscriptionFound+"\n")    #if a match is made write it to outfile on new line
                print(subscriptionFound)            #print to console each subscription found

            except: #no matches found
                out_file.write("No subscription found!\n")
                print("No subscription found!")   

Here is what the infile, Subs_HTML_Code.txt looks like:

Subscribe to Fred's Auto Repair."}},"unsubscribeAccessibility":Subscribe to Fred's Auto Repair."}},"unsubscribeAccessibility":{"accessibilityData":{"label":"Unsubscribe from 
unbscribeText":{"simpleText":"31.4K"},"subscribedButtonText":{"runs":[{"text":"Subscribed"}]},"unsubscribedButtonText":{"runs":Subscribe to Fred's Auto Repair 
."}},"unsubscribeAccessibility":{"accessibilityData":{"label":"Unsubscribe from [{"text":"Unsubscribe"}]},"longSubscriberCountText":{"runs":[{"text":"31.4K

Unsubscribe from [{"text":"Unsubscribe"}]},"longSubscriberCountText":{"runs":[{"text":"31.4KSubscribe to Fred's Auto Repair."}},"unsubscribeAccessibility":
{"accessibilityData":{"label":"Unsubscribe from [{"text":"Unsubscribe"}]},"longSubscriberCountText":{"runs":[{"text":"31.4K

And here is what my program gets me. Written to the outfile, Subs_List.txt looks like:

Subscribe to Fred's Auto Repair."}},"unsubscribeAccessibility":Subscribe to Fred's Auto Repair."}},"unsubscribeAccessibility":{"accessibilityData":{"label":"Unsubscribe from unbscribeText":{"simpleText":"31.4K"},"subscribedButtonText":{"runs":[{"text":"Subscribed"}]},"unsubscribedButtonText":{"runs":Subscribe to Fred's Auto Repair ."}},"unsubscribeAccessibility
No subscription found!
Subscribe to Fred's Auto Repair."}},"unsubscribeAccessibility

I don't know why I'm getting a lot of text after Fred's Auto repair. This is okay: Fred's Auto Repair."}}," But I am getting one of the keywords in there too.

Also if more than once instance of the string appears on the same line, my program writes it to the same line. Why?

  • Looks like your data was originally JSON – have you considered using a JSON parser in the first place? (That is, use `json.loads()` and treat the data structures like they are, and not as blobs of text.) – user1686 Sep 09 '20 at 05:31
  • The reason for your result is that your re is greedy. You match everything between the first Subscribe and the last unsubscribe, not the next unsubscribe. – Wups Sep 09 '20 at 06:17
  • Also, the input text is not really arranged in lines. Better search the whole text at once. – Wups Sep 09 '20 at 06:23
  • @user1686, thank you for the idea and the repy. I did not paste in exactly what the infile looks like. But it starts out with the – Underdrummer Sep 11 '20 at 03:33

1 Answers1

0

In your case, simply replacing .+ with the non-greedy qualifier *? like this: re.search(r"Subscribe to(.*?)unsub", line) will give you Freds Auto Repair."}},".

Also see: Python non-greedy regexes

If you know that certain characters won't appear in the string you want to find, but will appear after the string, you could use a negative character class. For example [^}]

re.search will only find the first match. To find every match use re.findall instead:

result = re.findall(r"Subscribe to ([^}]+)\.\"", whole_string)
if result:
    for match in result:
        print(match)

Also, don't scan your string line by line, but all at once. This will find matches that begin in one line and end in the next line, too.

Wups
  • 2,489
  • 1
  • 6
  • 17
  • Thank you for your answer. Both of these solutions work OK with my test infile but not my actual infile. I did not post the actual infile because it is too large. I thought my test infile was a decent example for the purposes of posing my question and testing my program but I see now I was wrong. Not sure how to post large examples. – Underdrummer Sep 11 '20 at 03:44
  • Both solutions only return one Fred's Auto Repair per line. The infile has once instance of Fred's Auto Repair appearing twice in one line. The negative character class search returns cleaner results of course. So I will go with that one. I don't think my true infile has any instances of my target strings appearing twice in one line but using the program for other infiles may cause problems. – Underdrummer Sep 11 '20 at 03:49
  • @Underdrummer To find **all** matches use `re.findall` instead. I have updated my answer. – Wups Sep 11 '20 at 06:53