0

[Edited] Question: How does the code in the example Option 2 (bottom of page) match an input string with whitespace chars., without explicitly defining the whitespace chars in the regex (I assume it must be doing so somehow, or else it would not find a match, and produce the correct output - I just don't know what it is)

Program Structure: Given an input string of HTML text (per examples A & B below) extract the Youtube URL from the embedded HTML text, and then print the url in the specified format.

These are the 2 HTML input strings used to test the function parse(s):

Ex. A:

<iframe src="https://www.youtube.com/embed/xvFZjo5PgG0"></iframe>

Ex. B:

<iframe width="560" height="315" src="https://www.youtube.com/embed/xvFZjo5PgG0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

The URL's found within these HTML strings (above) can be in any of the 3 formats below, whereby the regex should be able to optionally match either: "http://", "https://" or "https://www"

http://youtube.com/embed/xvFZjo5PgG0
https://youtube.com/embed/xvFZjo5PgG0
https://www.youtube.com/embed/xvFZjo5PgG0

Both input strings (Ex.A & Ex.B) should produce the following output when passed to parse(s):

https://youtu.be/xvFZjo5PgG0

Option 1: Per below, this solution code correctly returns the expected output when passing the specified input strings to parse(s). Further, in order to handle whitespace in the HTML input string, this solution uses the str.replace( ) function to clean the input directly, by replacing all "whitespace" chars such as the space in between "<iframe src".... Therefore, I do not define the whitespace chars in the regex, because they've cleaned from the input.

import re


def main():
    print(parse(input("HTML: ").replace(" ","")))


def parse(s):
    if matches := re.search(r"^(?:<iframe[=\w\"]*src=)?\"(?:https?://)(?:www\.)?youtube\.com/embed/(\w*)\"(?:[\w=\";-]*></iframe>)?$", s):
        id = matches.group(1)
        url = f"https://youtu.be/{id}"
        return url


if __name__ == "__main__":
    main()

Option 2: This solution also produces the correct output when passing the input string (Ex. A or Ex. B above) to parse(s). However, in this solution there is no explicit handling of whitespace chars either by cleaning the input string (as in Option 1), or explicitly defining whitespace chars in the regex. Yet, it must be doing so somehow, as it still correctly matches the string, which has whitespace chars.

import re


def main():
    print(parse(input("HTML: ")))


def parse(s):
    if matches := re.search(r"(?:<iframe[=\w\"]*src=)?\"(?:https?://)(?:www\.)?youtube\.com/embed/(\w*)\"([\w=\";-]*></iframe>)?", s):
        id = matches.group(1)
        url = f"https://youtu.be/{id}"
        return url


if __name__ == "__main__":
    main()

In summary, once more, how does Option 2 (above) find a match (when passed either string Ex. A or Ex. B) and produce the correct output, considering there is no explicit handling of whitespace chars?

DAK
  • 116
  • 1
  • 10
  • there's lots of people here who can help, but it's really not clear what your question is. Just post an example of what's happening unexpectedly, the solution you were expecting and what you've tried. Don't worry about the backstory – bn_ln Oct 30 '22 at 21:51
  • 2
    I have no account on CS50 and am not planning to create one for this question. You should include in your question the necessary information to *reproduce* the behaviour that you are describing. Don't expect us to log into that third party website or to guess on how it works. – trincot Oct 30 '22 at 21:54
  • 1
    In the second solutio the iframe blocks searched are optional (`?`), hence if they are not matched, this is not an issue to still match some http: ... youtube address (and thus, no whitespace needs to be matched around the youtube address). Is that what you haven't spotted? – Pac0 Oct 30 '22 at 21:58
  • Wasn't remotely expected anyone to "create accounts" or anything like that, but obviously my question wasn't clear. Re-edited the entire question, hopefully making it clearer? – DAK Oct 31 '22 at 00:57
  • When I run your code, option 2 returns "None" for both input strings. You sure you saved your changes? – Tim Roberts Oct 31 '22 at 01:08
  • @TimRoberts saving changes isn't enough, you must also `reload` the file. See https://stackoverflow.com/a/32234323/5987 – Mark Ransom Oct 31 '22 at 01:21
  • @TimRoberts Hm not sure what the issue is, I checked the code I posted against the actual and it looks to be the same. I also copy/pasted the posted code and it still runs correctly for me for both input strings? – DAK Oct 31 '22 at 03:07
  • I agree with Pac0 - I think you missed the optional (`?`) item which is ignoring everything before `src=`. It might be clearer if you see it here: https://regex101.com/r/1vKBG5/1 – ScottC Oct 31 '22 at 13:55
  • @Pac0 `@Scott` Pac0 is correct, that's exactly the issue. I had missed the comment initially. Once the the optional (`?`) symbol is removed, the test strings return "None" until a space is explicitly included inside the brackets - `[=\w\" ]` – DAK Nov 01 '22 at 03:34

1 Answers1

0

I think you have a slight misunderstanding of exactly how both are working, but let's start with 2 as answering how that works provides some illumination on how 1 works.

Why does Option 2 work?

The following regex (B):

(?:<iframe[=\w\"]*src=)?\"(?:https?://)(?:www\.)?youtube\.com/embed/(\w*)\"([\w=\";-]*></iframe>)?

Actually does not handle whitespace, if you try it out in an online regex tool, you can see that what it's actually doing is matching like so:

Ex A <iframe src="https://www.youtube.com/embed/xvFZjo5PgG0"></iframe>:

  1. "https://www.youtube.com/embed/xvFZjo5PgG0"
  2. xvFZjo5PgG0
  3. ></iframe>

Ex B <iframe width="560" height="315" src="https://www.youtube.com/embed/xvFZjo5PgG0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>:

  1. "https://www.youtube.com/embed/xvFZjo5PgG0"
  2. xvFZjo5PgG0

The other content in the string is completely ignored, but since you are doing a python re.search, you still get a match as it searches the string for a match. If you were to do a re.match which forces the string to match from the beginning, it would break.

You can test this by changing the input string to simply "https://www.youtube.com/embed/xvFZjo5PgG0" and it still works the exact same way (getting all the exact same matches).

In fact, in this scenario, most of the regex is superflous, the heavy lifting is done by \"(?:https?://)(?:www\.)?youtube\.com/embed/(\w*)\". The rest does nothing for a string that has not had white-space stripped, and does next to nothing when you are doing re.search.

You can see this by throwing white-space stripped strings at this regex and see how the matches change:

Ex A <iframesrc="https://www.youtube.com/embed/xvFZjo5PgG0"></iframe>:

  1. <iframesrc="https://www.youtube.com/embed/xvFZjo5PgG0"></iframe>
  2. xvFZjo5PgG0

Ex B <iframewidth="560"height="315"src="https://www.youtube.com/embed/xvFZjo5PgG0"title="YouTubevideoplayer"frameborder="0"allow="accelerometer;autoplay;clipboard-write;encrypted-media;gyroscope;picture-in-picture"allowfullscreen></iframe>:

  1. <iframewidth="560"height="315"src="https://www.youtube.com/embed/xvFZjo5PgG0"title="YouTubevideoplayer"frameborder="0"allow="accelerometer;autoplay;clipboard-write;encrypted-media;gyroscope;picture-in-picture"allowfullscreen></iframe>
  2. xvFZjo5PgG0

So why does Option 1 work?

The reason the first option works is that you are squishing everything up and removing the white-space, which finally makes the rest of the regex do something. In fact the only real difference in the regexes of the two options is that you are forcing re.search to act as re.match by adding the ^ and $ which forces the entire string to be matched, but which does nothing for a white-space stripped string.

Ex A <iframesrc="https://www.youtube.com/embed/xvFZjo5PgG0"></iframe>:

  1. <iframesrc="https://www.youtube.com/embed/xvFZjo5PgG0"></iframe>
  2. xvFZjo5PgG0

Ex B <iframewidth="560"height="315"src="https://www.youtube.com/embed/xvFZjo5PgG0"title="YouTubevideoplayer"frameborder="0"allow="accelerometer;autoplay;clipboard-write;encrypted-media;gyroscope;picture-in-picture"allowfullscreen></iframe>:

  1. <iframewidth="560"height="315"src="https://www.youtube.com/embed/xvFZjo5PgG0"title="YouTubevideoplayer"frameborder="0"allow="accelerometer;autoplay;clipboard-write;encrypted-media;gyroscope;picture-in-picture"allowfullscreen></iframe>
  2. xvFZjo5PgG0

TL;DR:

It works because re.search will match in the middle of the string, and the only non-optional part of the regex (\"(?:https?://)(?:www\.)?youtube\.com/embed/(\w*)\") will never have spaces.

Digital Deception
  • 2,677
  • 2
  • 15
  • 24
  • Your point is extremely well taken & it might be helpful to be even more explicit - the logic of the question is flawed. While the regex worked w/o defining whitespace, it was not b/c whitespace was "magically" handled. As you said, it was b/c `re.search` searched for & found, the only consequential/non-optional part `\"(?:https?://)(?:www\.)?youtube\.com/embed/(\w*)\"`. Similarly, although removing the `?` blocks & explicitly defining whitespace in `[=\w\" ]` worked, it wasn't b/c it matched all whitespace. It just made ` – DAK Nov 03 '22 at 21:25