-1

Suppose I have the following HTML elements:

<iframe height="100" width="200" src="https://www.stackoverflow.com/share"></iframe>


<iframe height="100" width="200" src="https://www.google.com/share"></iframe>


<iframe height="100" width="200" src="https://www.yahoo.com/share"></iframe>

I ant to use Regex in order to find the iframe with a specific src (has to contain https://www.stackoverflow.com/share/{s} attribute and get the other attributes associated within this html attribute.

So in this instance, the regex would return:

Group 1: https://www.google.com/share Group 2: 100 Group 3: 200

I have tried the following:

iframe.*src[^""]+['"]+(https:\/\/www.google.com\/share)

Which finds the specific URL and gives me it the group, no matter where it is within the string.

The issue I'm facing is expanding on this to return all of the other attributes within the HTML element.

I have tried the add the following to the Regex:

\s+width="(.*?)"\s+height="(.*?)"

But this returns no match.

How (possibly), using the current regex that I've formed to get the remaining attributes values using regex?

My regex 101 file

Toto
  • 89,455
  • 62
  • 89
  • 125
Phorce
  • 4,424
  • 13
  • 57
  • 107
  • 4
    Regex is not the best choice for parsing HTML. Use a real parser instead, for ecample HtmlAgilityPack – Flat Eric Mar 17 '18 at 14:19
  • @FlatEric Hey - I have to use regex for this specific project! – Phorce Mar 17 '18 at 14:21
  • [It is too bad you must use regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Crowcoder Mar 17 '18 at 14:21
  • Do you have `\K` suppport? You can use `iframe.*src[^""]+['"]+\K(https:\/\/www.google.com\/share)` https://regex101.com/r/YiYXDg/2 – mrzasa Mar 17 '18 at 14:22
  • @mrzasa Finding the `src` is fine, I have achieived that. It's finding the other elements such as `width` and `height` that is the problem i'm facing – Phorce Mar 17 '18 at 14:23
  • width and height in your sample code are in the opposite order than in your regex. Apart from that: I also recommend to use HtmlAgilityPack. In order for your regex to work reliably with all valid HTMLs, you had to support different attribute ordering, HTML comments, whitespace etc.A HTML parser takes care of all that for you. – Heinz Kessler Mar 17 '18 at 14:30
  • Isn't [AngleSharp](https://github.com/AngleSharp/AngleSharp) better than HtmlAgilityPack? Why does everyone still recommend HtmlAgilityPack? – Crowcoder Mar 17 '18 at 14:32
  • Yeah I cannot use anything but regex so do you think it will be impossible to do it using regex? – Phorce Mar 17 '18 at 14:34

1 Answers1

0

Update

For URLs like http://stackoverflow.com/share/xxx use (see https://regex101.com/r/bKbfTt/6):

iframe\s*(?:\s|width="(.*?)"|height="(.*?)")+src="(.*?www\.stackoverflow\.com\/share\/.*?)"

For URLs like http://stackoverflow.com/share (without the /xxx part) use (see https://regex101.com/r/bKbfTt/5):

iframe\s*(?:\s|width="(.*?)"|height="(.*?)")+src="(.*?www\.stackoverflow\.com\/share(?:\/.*?)?)"

The tested cases are:

<iframe height="100" width="200" src="https://www.youtube.com/share"></iframe>
<iframe height="100" width="200" src="https://www.youtube.com/share/xxx"></iframe>

<iframe height="100" width="200" src="https://www.stackoverflow.com/share/xxx"></iframe>
<iframe height="100" width="200" src="https://www.stackoverflow.com/share"></iframe>

<iframe height="100" width="200" src="https://www.google.com/share"></iframe>
<iframe height="100" width="200" src="https://www.google.com/share/xxx"></iframe>

<iframe height="100" width="200" src="https://www.yahoo.com/share"></iframe>
<iframe height="100" width="200" src="https://www.yahoo.com/share/xxx"></iframe>

Previous answer

Check this out: https://regex101.com/r/bKbfTt/1

The regex is iframe\s*(?:\s|width="(.*?)"|height="(.*?)"|src="(.*?)")+ which matched your examples:

<iframe height="100" width="200" src="https://www.youtube.com/share"></iframe>
<iframe height="100" width="200" src="https://www.stackoverflow.com/share"></iframe>
<iframe height="100" width="200" src="https://www.google.com/share"></iframe>
<iframe height="100" width="200" src="https://www.yahoo.com/share"></iframe>
Max Senft
  • 610
  • 4
  • 13
  • thanks but is there a way to match a specific URL? It cannot be applied to all i frames on the the page only ones with a specific URL – Phorce Mar 17 '18 at 14:38
  • I updated the file: https://regex101.com/r/bKbfTt/2 Is this what you want? I'm still not really sure. ;-) – Max Senft Mar 17 '18 at 14:41
  • it shows there are 4 results when there should only be one since it should only match the one where the url matches.. if that makes sense? So if the URL is stackoverflow.com/share it would only match that one and give me the attributes for that case – Phorce Mar 17 '18 at 14:45
  • Well, the problem is that width, height and src could be in any order. Or is the order fixed? Or is src always the last attribute? – Max Senft Mar 17 '18 at 14:47
  • Is it possible to check the string in multiple steps? Like 1. Check if it is an iframe, 2. check for `src="(.*?www\.stackoverflow\.com\/share\/.*?)"`, 3. check for `width="(.*?)"`, 4. check for `height="(.*?)")` – Max Senft Mar 17 '18 at 14:51
  • Yeah this is fine to check because I just want to isolate that an iframe with the specific URL exists and get the attributes. I believe the positioning of the elements will not change – Phorce Mar 17 '18 at 14:54
  • Thank you! I was able to use your solution to figure it out and now it works just fine! Going to do some more testing but accepting your answer :) – Phorce Mar 17 '18 at 15:27