[Edited] Question: How does the code in the example Option 2 (bottom of page) match an input string with whitespace chars., without explicitly defining the whitespace chars in the regex (I assume it must be doing so somehow, or else it would not find a match, and produce the correct output - I just don't know what it is)
Program Structure: Given an input string of HTML text (per examples A & B below) extract the Youtube URL from the embedded HTML text, and then print the url in the specified format.
These are the 2 HTML input strings used to test the function parse(s):
Ex. A:
<iframe src="https://www.youtube.com/embed/xvFZjo5PgG0"></iframe>
Ex. B:
<iframe width="560" height="315" src="https://www.youtube.com/embed/xvFZjo5PgG0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
The URL's found within these HTML strings (above) can be in any of the 3 formats below, whereby the regex should be able to optionally match either: "http://", "https://" or "https://www"
http://youtube.com/embed/xvFZjo5PgG0
https://youtube.com/embed/xvFZjo5PgG0
https://www.youtube.com/embed/xvFZjo5PgG0
Both input strings (Ex.A & Ex.B) should produce the following output when passed to parse(s):
https://youtu.be/xvFZjo5PgG0
Option 1: Per below, this solution code correctly returns the expected output when passing the specified input strings to parse(s). Further, in order to handle whitespace in the HTML input string, this solution uses the str.replace( ) function to clean the input directly, by replacing all "whitespace" chars such as the space in between "<iframe src".... Therefore, I do not define the whitespace chars in the regex, because they've cleaned from the input.
import re
def main():
print(parse(input("HTML: ").replace(" ","")))
def parse(s):
if matches := re.search(r"^(?:<iframe[=\w\"]*src=)?\"(?:https?://)(?:www\.)?youtube\.com/embed/(\w*)\"(?:[\w=\";-]*></iframe>)?$", s):
id = matches.group(1)
url = f"https://youtu.be/{id}"
return url
if __name__ == "__main__":
main()
Option 2: This solution also produces the correct output when passing the input string (Ex. A or Ex. B above) to parse(s). However, in this solution there is no explicit handling of whitespace chars either by cleaning the input string (as in Option 1), or explicitly defining whitespace chars in the regex. Yet, it must be doing so somehow, as it still correctly matches the string, which has whitespace chars.
import re
def main():
print(parse(input("HTML: ")))
def parse(s):
if matches := re.search(r"(?:<iframe[=\w\"]*src=)?\"(?:https?://)(?:www\.)?youtube\.com/embed/(\w*)\"([\w=\";-]*></iframe>)?", s):
id = matches.group(1)
url = f"https://youtu.be/{id}"
return url
if __name__ == "__main__":
main()
In summary, once more, how does Option 2 (above) find a match (when passed either string Ex. A or Ex. B) and produce the correct output, considering there is no explicit handling of whitespace chars?