Regex to find URL parameters in HTML (Ruby)

Question

I am attempting to replace embedded YouTube videos with thumbnails in dynamically created email templates. I am attempting to find each YouTube ID from each embedded URL, then replace the entire block with custom HTML. I have it working if there is only one embedded video with the following RegEx:

<span contenteditable="false" draggable="true" fr-original-class="fr-video\sfr-dvb\sfr-draggable"\s.*\ssrc="[a-z:]*?\/\/w{3}?.?youtube.com\/embed\/([a-zA-Z\d\-]*).*<\/iframe><\/span>

The problem is, if there is more than one video, it will only find the ID from the last video. I feel like I may be over-complicating this.

Note that the attributes of the span that the embedded video is in will always be the same (contenteditable="false" draggable="true" fr-original-class="fr-video).

A sample email template is below, the above RegEx only pulls the second ID from this, not the first. I would like to pull both.

This is being done in Ruby.

EDIT: I realize the RegEx I am using is probably overkill but I need a complex RegEx for the gsub replace so that I only replace the video and it's container, not anything surrounding it.

<!DOCTYPE html>
<html>
  <head>
    <meta content='text/html; charset=UTF-8' http-equiv='Content-Type'>
  </head>
  <body style='margin: 0px; font-family: Helvetica Neue,Helvetica,Arial,sans-serif; font-size: 18px;'>
    <table border='0' cellpadding='0' cellspacing='0' style='font-family: Helvetica Neue,Helvetica,Arial,sans-serif; width: 600px;' width='600'>
      <tr>
        <td>
          FooBar
          <br>
          <br>
          <span contenteditable="false" draggable="true" fr-original-class="fr-video fr-dvb fr-draggable" fr-original-style="-webkit-user-select: none;" style="-webkit-user-select: none; text-align: center; position: relative; display: block; clear: both;">
            <iframe src="//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed&amp;url=http://www.youtube.com/watch?v=e7zCqsjK1Vg&amp;image=https://i.ytimg.com/vi/e7zCqsjK1Vg/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube" width="600" height="338" scrolling="no" frameborder="0" allowfullscreen="" style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-class="embedly-embed"></iframe>
          </span>
          <br>
          Foo Bar
          <br>
          <br>
          <span contenteditable="false" draggable="true" fr-original-class="fr-video fr-dvb fr-draggable" fr-original-style="-webkit-user-select: none;" style="-webkit-user-select: none; text-align: center; position: relative; display: block; clear: both;">
            <iframe src="//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/skLz87ixE48?feature=oembed&amp;url=http://www.youtube.com/watch?v=skLz87ixE48&amp;image=https://i.ytimg.com/vi/skLz87ixE48/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube" width="600" height="338" scrolling="no" frameborder="0" allowfullscreen="" style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-class="embedly-embed"></iframe>
          </span>
          <br>
        </td>
      </tr>
      <tr style='font-family: Helvetica Neue,Helvetica,Arial,sans-serif; font-size: 12px; color: #656565; text-align: center;'>
        <td style='padding: 10px 0px;'>
        </td>
      </tr>
    </table>
  </body>
</html>

So if I understand this correctly, you're trying to do 2 things with regex? One of which is remove the `...`s containing YouTube embeds? And the second is to capture the IDs of those YouTube embeds? — wpcarro, Jun 29 '16 at 19:50
@wcarroll that is correct. Doing the two operations separately is fine. I would like to match the IDs of the embeds and for each ID I find, replace the YouTube embed and it's container with custom HTML I generate. My current RegEx finds the beginning of the first embed (``) and matches with the end of the second embed (``) which is not what I want, obviously. — tommybond, Jun 29 '16 at 19:53
It's strongly recommended you use a parser rather than regular expressions when working with HTML or XML. See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1 for a historical discussion. The defacto parser for Ruby is [Nokogiri](http://www.nokogiri.org). Nokogiri makes it easy to find particular nodes, extract information, and modify the DOM without using `sub` or `gsub`. — the Tin Man, Jun 29 '16 at 20:06
@theTinMan that definitely makes sense rather than using `gsub`. Thanks for this reminder. — tommybond, Jun 29 '16 at 20:14

the Tin Man · Accepted Answer · 2016-06-29T20:34:29.463

Don't use regular expressions for this. There are existing tools to make it much easier:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<!DOCTYPE html>
<html>
  <body>
    <table>
      <tr>
        <td>
          <span>
            <iframe src="//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed&amp;url=http://www.youtube.com/watch?v=e7zCqsjK1Vg&amp;image=https://i.ytimg.com/vi/e7zCqsjK1Vg/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube" width="600" height="338" scrolling="no" frameborder="0" allowfullscreen="" style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-class="embedly-embed"></iframe>
          </span>
          <span>
            <iframe src="//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/skLz87ixE48?feature=oembed&amp;url=http://www.youtube.com/watch?v=skLz87ixE48&amp;image=https://i.ytimg.com/vi/skLz87ixE48/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube" width="600" height="338" scrolling="no" frameborder="0" allowfullscreen="" style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-class="embedly-embed"></iframe>
          </span>
        </td>
      </tr>
    </table>
  </body>
</html>
EOT

At this point it's easy to search for the <span> tags. Here's the first one:

doc.search('span').first.to_html
# => "<span>\n            <iframe src=\"//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed&amp;url=http://www.youtube.com/watch?v=e7zCqsjK1Vg&amp;image=https://i.ytimg.com/vi/e7zCqsjK1Vg/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube\" width=\"600\" height=\"338\" scrolling=\"no\" frameborder=\"0\" allowfullscreen=\"\" style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-class=\"embedly-embed\"></iframe>\n          </span>"

last or regular array indexing could be used to find specific instances if necessary.

Instead of using search and first, we can use at instead, which already does them internally:

doc.at('span').to_html
# => "<span>\n            <iframe src=\"//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed&amp;url=http://www.youtube.com/watch?v=e7zCqsjK1Vg&amp;image=https://i.ytimg.com/vi/e7zCqsjK1Vg/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube\" width=\"600\" height=\"338\" scrolling=\"no\" frameborder=\"0\" allowfullscreen=\"\" style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-class=\"embedly-embed\"></iframe>\n          </span>"

We can dig into a node to grab its parameters:

doc.at('iframe')['src']
# => "//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed&url=http://www.youtube.com/watch?v=e7zCqsjK1Vg&image=https://i.ytimg.com/vi/e7zCqsjK1Vg/hqdefault.jpg&key=2aa3c4d5f3de4f5b9120b660ad850dc9&type=text/html&schema=youtube"

Once you have a URL, we have tools for manipulating them too:

require 'uri'
iframe = doc.at('iframe')
uri = URI.parse('http:' + iframe['src'])

We can extract the query:

uri.query # => "src=https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed&url=http://www.youtube.com/watch?v=e7zCqsjK1Vg&image=https://i.ytimg.com/vi/e7zCqsjK1Vg/hqdefault.jpg&key=2aa3c4d5f3de4f5b9120b660ad850dc9&type=text/html&schema=youtube"

We can parse it into a hash, making it easy to pick it apart:

URI::decode_www_form(uri.query).to_h['src']
# => "https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed"

... or modify it:

query = URI::decode_www_form(uri.query).to_h
query['src'] = 'http://example.com'

uri.query = URI::encode_www_form(query)

uri.to_s
# => "http://cdn.embedly.com/widgets/media.html?src=http%3A%2F%2Fexample.com&url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3De7zCqsjK1Vg&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fe7zCqsjK1Vg%2Fhqdefault.jpg&key=2aa3c4d5f3de4f5b9120b660ad850dc9&type=text%2Fhtml&schema=youtube"

Once you're there, it's easy to modify the HTML if necessary:

iframe['src'] = uri.to_s
iframe.to_html
# => "<iframe src=\"http://cdn.embedly.com/widgets/media.html?src=http%3A%2F%2Fexample.com&amp;url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3De7zCqsjK1Vg&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fe7zCqsjK1Vg%2Fhqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text%2Fhtml&amp;schema=youtube\" width=\"600\" height=\"338\" scrolling=\"no\" frameborder=\"0\" allowfullscreen=\"\" style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-class=\"embedly-embed\"></iframe>"

and:

doc.to_html
# => "<!DOCTYPE html>\n<html>\n  <body>\n    <table>\n      <tr>\n        <td>\n          <span>\n            <iframe src=\"http://cdn.embedly.com/widgets/media.html?src=http%3A%2F%2Fexample.com&amp;url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3De7zCqsjK1Vg&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fe7zCqsjK1Vg%2Fhqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text%2Fhtml&amp;schema=youtube\" width=\"600\" height=\"338\" scrolling=\"no\" frameborder=\"0\" allowfullscreen=\"\" style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-class=\"embedly-embed\"></iframe>\n          </span>\n          <span>\n            <iframe src=\"//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/skLz87ixE48?feature=oembed&amp;url=http://www.youtube.com/watch?v=skLz87ixE48&amp;image=https://i.ytimg.com/vi/skLz87ixE48/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube\" width=\"600\" height=\"338\" scrolling=\"no\" frameborder=\"0\" allowfullscreen=\"\" style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-class=\"embedly-embed\"></iframe>\n          </span>\n        </td>\n      </tr>\n    </table>\n  </body>\n</html>\n"

This isn't exactly an example of how to solve the problem you're asking about, instead it's a reminder that there are existing well-tested wheels based on the specs and we should use them.

I may have to use a mashup of both methods, I only want to pull `` nodes that have embedded YouTube videos within them. — tommybond, Jun 29 '16 at 20:39
No, it's possible to do without complex regex, using Nokogiri and URI. Read about CSS selectors and how to search inside parameters, or learn about XPath. Those have been discussed many times here on SO, and on the internet. — the Tin Man, Jun 29 '16 at 20:42
Okay, you were definitely correct. Just got this working really elegantly and simply using Nokogiri. Thanks a lot! — tommybond, Jun 29 '16 at 22:19
I'm glad it helped. The benefits for using a parser don't really kick in until you've written several scrapers or spiders and see how easy it is to root around in the DOM, or you're parsing XML or manipulating it. Regex break so easily, especially with tiny changes to the HTML or XML, and having to support a fragile solution is enough to make anyone scream. — the Tin Man, Jun 29 '16 at 23:26

wpcarro · Answer 2 · 2016-06-29T20:17:43.083

To grab the YouTube IDs, I think the best way would be to use look-arounds. The following should work.

(?<=embed\/)(.+?)(?=\?)

Here's a link to a demonstration on regex101.com

Turn on the "global" flag so that the regex engine doesn't stop after finding the first match. This regex uses a look-behind, (?<=embed\/); followed by a capturing group that matches wildcard characters in a non-greedy fashion, (.+?); followed by a look-ahead that asserts a literal question mark, (?=\?).

This should suffice in grabbing the video IDs.

As for replacing the HTML, here's a regex that will match the <span>...</span> blocks:

<span.*?>\s*<iframe.+?>.*?<\/iframe>\s*<\/span>

For this to work, apply the s flag to the regex engine so that . wildcard characters can match \/n newline characters. Also apply the g flag for the same reasons mentioned previously.

NOTE: this will capture any <span> groups that have <iframe>s as direct children. Depending on the content with which you are working, you may need to add more specificity to the regex to scan the attributes on those <iframe>s. For the content you provided to this question, however, it appears to work.

Let me know if you'd like any clarification or additional functionality.

Here's a link to a demonstration on regex101.com.

Fantastic, thank you so much for this. The first regex seems to work wonderfully for my purpose, though the second one doesn't seem to work with the example I've posted. I did alter it to `\s*.*?<\/iframe>\s*<\/span>` to account for the `` attributes, but it still does not seem to be working. — tommybond, Jun 29 '16 at 20:10
How about this? `\s*.*?<\/iframe>\s*<\/span>` I'll edit my answer if this works for you. Make sure the flags are set to `g` and `s`. This is working here. https://regex101.com/r/nF0bQ6/1 Do you have additional content for which this fails? — wpcarro, Jun 29 '16 at 20:14
Great. I'll edit my response and then can you mark it as correct? — wpcarro, Jun 29 '16 at 20:17

Regex to find URL parameters in HTML (Ruby)

2 Answers2