0

I want to write a regex to ignore iframes containing urls from youtube, vimeo or soundcloud in a string encoded with HTML entities.

This is what i tried and is not working. Some sample texts are given below

REGEX

<iframe(^?youtube|soundcloud|vimeo)*\/iframe

SAMPLE TEXT

<p><iframe src="http://www.3you3tube.com/embed/YoX1yc92MOU" width="500" height="300" frameborder="0" scrolling="auto"></iframe></p>
29  <p>text daily to place domain staff as volunteers with charity partners, we know all too well that the "V" word can sometimes be misunderstood. Occasionally seen as a dusty, worthy word, it can conjure images of coffee mornings and bric-a-brac stalls. So its not always as easy as you might think to get people to embrace their inner-volunteer. That's why the <a href="http://www.domain.co.uk/sdfn/2010/11/connect-create-domain-volunteers.shtml">Conne

SAMPLE OUTPUT

<iframe src="http://www.3you3tube.com/embed/YoX1yc92MOU" width="500" height="300" frameborder="0" scrolling="auto"></iframe>

SAMPLE TEXT

<p><iframe src="http://www.youtube.com/embed/YoX1yc92MOU" width="500" height="300" frameborder="0" scrolling="auto"></iframe></p>
29  <p>text daily to place domain staff as volunteers with charity partners, we know all too well that the "V" word can sometimes be misunderstood. Occasionally seen as a dusty, worthy word, it can conjure images of coffee mornings and bric-a-brac stalls. So its not always as easy as you might think to get people to embrace their inner-volunteer. That's why the <a href="http://www.domain.co.uk/sdfn/2010/11/connect-create-domain-volunteers.shtml">Conne

SAMPLE OUTPUT

nil

Just to be clear:

i want to ignore iframes which have youtube, vimeo or soundcloud in them.

and i am testing it on rubular http://rubular.com/r/F9x6SSkIfu

Zeeshan Abbas
  • 821
  • 6
  • 20
  • 3
    This isn't a good use of regular expressions. HTML can vary too much for a pattern to handle. Instead, decode the entities back into HTML, then use a parser, such as Nokogiri, which will normalize the HTML, making it easy to ignore differences in order, whitespace, capitalization, etc. – the Tin Man Sep 02 '14 at 22:28
  • i tried your mentioned solution and it seems like the data is not very consistent. There are several broken tags which are causing nokogiri not to parse the HTML string properly. One of the examples is this question : http://stackoverflow.com/questions/25596881/how-to-parse-xml-with-nokogiri-without-losing-html-entities/25604318#25604318 – Zeeshan Abbas Sep 03 '14 at 09:16

3 Answers3

1
<iframe.*?src="(?![^"]*(?:youtube|vimeo|soundcloud)).*?<\/iframe>

Demo


The key here is iframe.*?src="(?=[^"]*(?:youtube|vimeo|soundcloud)), so let me expand that for you:

iframe                          ?# literally match iframe
.*?                             ?# lazily match 0+ characters
src="                           ?# literally match src="
(?!                             ?# start negative lookahead assertion
  [^"]*                         ?# match 0+ non-" characters
  (?:youtube|vimeo|soundcloud)  ?# match one of the domains
)                               ?# end assertion

So as soon as the expression reaches an iframe's src attribute, it will negatively assert for one of the domains after any number of non-" characters (in other words, until the end of the src attribute). As long as we don't find one of these domains in the attribute, we continue on by lazily matching the rest of the iframe (until the closing tag).

Sam
  • 20,096
  • 2
  • 45
  • 71
0

You can use this regex:

.*?iframe src=".*?(?:youtube|soundcloud|vimeo).*?".*|(.*?iframe src=".*?".*)

Working demo

You can see that for the first input (the green one) there output is what you specified in the question. For the blue match there is no output since it is a valid match for youtube, soundcloud or vimeo.

enter image description here

Match information

MATCH 1
1.  [0-155] `<p><iframe src="http://www.3you3tube.com/embed/YoX1yc92MOU" width="500" height="300" frameborder="0" scrolling="auto"></iframe></p>`
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
0

HTML is notoriously difficult to parse using regular expressions unless you own the generation of that HTML, and, even then it's a pain.

Instead, for anything beyond the most trivial use, go for a parser, which can normalize away a lot of the problems that make a pattern fail.

The patterns submitted will fail because they assume tag-name case, whitespace and string delimiters for the src parameter. Those could be accommodated in the pattern, but it's easier to not bother. In the following code, all the strings being checked are valid HTML:

require 'htmlentities'
require 'nokogiri'

[
  %#<p><iframe\nsrc="http://www.youtube.com/embed/YoX1yc92MOU_1"</iframe></p>#,
  %#<p><iframe\nsrc= "http://www.youtube.com/embed/YoX1yc92MOU_2"</iframe></p>#,
  %#<p><iframe\nsrc = "http://www.youtube.com/embed/YoX1yc92MOU_3"</iframe></p>#,
  %#<p><iframe\nsrc = 'http://www.youtube.com/embed/YoX1yc92MOU_4'</iframe></p>#,
  %#<p><Iframe\nsrc = 'http://www.youtube.com/embed/YoX1yc92MOU_5'</iframe></p>#,
  %#<p><IFRAME\nsrc = 'http://www.youtube.com/embed/YoX1yc92MOU_6'</iframe></p>#,
  %#<p><IFRAME\nsrc =
  'http://www.youtube.com/embed/YoX1yc92MOU_7'</iframe></p>#,
].each do |text|
  html = HTMLEntities::Decoder.new('html4').decode(text)
  doc = Nokogiri::HTML::DocumentFragment.parse(html)

  iframe = doc.at('iframe')
  puts "Ignoring: #{ iframe['src'] }" if iframe['src'][/\b(?:youtube|soundcloud|vimeo)\b/i]
end
# >> Ignoring: http://www.youtube.com/embed/YoX1yc92MOU_1
# >> Ignoring: http://www.youtube.com/embed/YoX1yc92MOU_2
# >> Ignoring: http://www.youtube.com/embed/YoX1yc92MOU_3
# >> Ignoring: http://www.youtube.com/embed/YoX1yc92MOU_4
# >> Ignoring: http://www.youtube.com/embed/YoX1yc92MOU_5
# >> Ignoring: http://www.youtube.com/embed/YoX1yc92MOU_6
# >> Ignoring: http://www.youtube.com/embed/YoX1yc92MOU_7

"RegEx match open tags except XHTML self-contained tags" is an obligatory link on Stack Overflow when these sort of questions arise. The most famous answer is tongue-in-cheek of course, but it makes the point not to do this with patterns.

In the code above, /\b(?:youtube|soundcloud|vimeo)\b/i is a regular expression, but it is short and sweet and isn't applied to the HTML at all. Instead, it's used against the content of the src parameter, which has to be correct in the (encoded) HTML and can't be mangled/munged, otherwise the iframe itself would not work.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303