Methodology
I think this would be a good use for Regular Expressions or RegEx. This is essentially a pattern-matching standard that allows for generalizing a certain structure in a string. While use in HTML and XML is not a good idea as it is not designed to extract any information you may be looking for, it should work if all you want to do is discard certain sections.
Explanation
If I understand correctly, you would like to be left with
<br>Hi, please keep this text.<br>
<br>Also not Zoom default text.<br>
Which means we need to come up with a pattern to match the following portion(the brackets indicating the information that will swap every time):
<br>[Name] is inviting you to a scheduled Zoom meeting.<br>
<br>Topic: [Name]\'s Personal Meeting Room<br>
<br>Join Zoom Meeting<br>
<a href="[Link]">[Link]</a><br>
<br>Meeting ID: [ID]<br>
Password: [Password]<br>
Important Pieces:
The beginning: [Name] will be some string of at least one character. To make sure you don't match <br>Hi, please keep this text.<br>
, the part we want to match any characters that aren't "<br>" (this is represented in regex with [^(?:<br>)]
), where "character" means anything other than a line break. The rest of the sentence should be matched word for word, so we're not just matching anything.
The end: [Password], like [Name], is just [^(?:<br>)]
for the same reason.
This string starts and ends with "<br>". This should be reflected in the regex
Everything between that first sentence and the password portion, even
though they have a format, they are wildcards, some mix of at least
one character or linebreak (represented in regex with
(.|\n)+
)
Replacing all of the appropriate portions in the text, you get the following:
<br>[^(?:<br>)]+? is inviting you to a scheduled Zoom meeting.+?Password: [^(?:<br>)]+?<br>
Code
As for the Python, the re module will come in handy here as your regex aid:
We want to save the above pattern into a variable, and use the information to cut the appropriate portion out of the string.
To "save" the pattern, the re module allows you to compile the regex into an object (the r before the string indicates that it contains regex)
import re
zoom_pattern = re.compile(r"<br>[^(?:<br>)]+? is inviting you to a scheduled Zoom meeting.+?Password: [^(?:<br>)]+?<br>")
The module also provides the ability to split replace regex matches within strings, and we can replace our match with nothing to cut it out of the string:
import re
s = " - string with zoom meeting stuff - "
zoom_pattern = re.compile(r"<br>[^(?:<br>)]+? is inviting you to a scheduled Zoom meeting.+?Password: [^(?:<br>)]+?<br>")
clean_string = zoom_pattern.sub("", s)
Since we compiled the pattern, you now have a reusable way to clean up your string!
If you'd like to change your regex to match each individual thing, just adjust the "Important points" from earlier to match your goal. If you want to test your ideas, this is a wonderful resource!
Hi, please keep this text.
` at the beginning because it also matches the `
.+` part of the regex? – Danlo9 Jun 28 '20 at 17:13
.+` by itself would indeed match that as well. The difference is, to match that whole string, it would also need to match the line break between the lines. The `.+` only selects non-line break characters, so that portion would not match. If you'd like, you could make the name portion only match "word characters" (letters, numbers, and underscores), but that would mean you would have to account for hyphenated names, first name-last name, etc. – Marcel M Jun 28 '20 at 17:21
Hi, please keep this text.
[name] is inviting you to a scheduled Zoom meeting.
Topic: [name]\'s Personal Meeting Room
Join Zoom Meeting
[link]
Meeting ID: 448 751 8794
Password: [Password]
Also not Zoom default text.``` – Danlo9 Jun 28 '20 at 21:09