0

I am trying to use REGEX in order to find out if a certain tag is found inside the source code of a website.

The tag definitely starts with "<link" and definitely ends with ">". There are some strings that need to be found between these two delimiters in order to "succeed":

  • 'rel="alternate"'
  • 'media='
  • 'max-width:'
  • '640px'
  • 'only screen'
  • 'href='

Within the source code of a website this should match for example the following tag:

The problem is, that the elements within this tag can be in a different sequence like for example:

< link media="only screen and (max-width: 640px)" rel="alternate" href="http://m.example.com/page-1">

or:

< link href="http://m.example.com/page-1" media="only screen and (max-width: 640px)" rel="alternate"/>

My problem is, that using a REGEX formula like

(<link ).*(rel=).*(media=).*(640px).*(href=).*(>)

would need rel, media, 640px and href to be in exactly this order, but it is also possible that the order is completely another way around.

What I did so far:

  • searching through Stackoverflow (it was probably asked before - but I couldn't find the solution. The former asker probably did use very differents wording when describing)
  • try building the formula on https://regex101.com (this way I came up with what I have so far)

Can anyone push me in the right direction please? Thank you in advance to everyone!

JMW
  • 261
  • 2
  • 7
Shopsi
  • 11
  • 1
  • I would recommend you to use an HTML parser, or try to see if any of the suggestions here > https://stackoverflow.com/a/7564061/1535270 – abestrad Jul 02 '20 at 21:17

1 Answers1

0

You can use the following regular expression to identify clauses beginning with a match of < *link and ending with a match of *> that satisfy the stated requirements.

< *link(?=[^>\n]*\brel="alternate")(?=[^>\n]*\bmedia=)(?=[^>\n]+\bmax-width:)(?=[^>\n]+\b640px\b)(?=[^>\n]+\bonly[ -]screen\b)(?=[^>\n]*\bhref=)[^>\n]+ *>

Start your engine!

Python's regex engine performs the following operations.

< *link             : match '<', 0+ spaces, 'link'
(?=                 : begin positive lookahead
  [^>\n]*           : match 0+ characters other than
                      '>' and line terminator
  \brel="alternate" : match 'rel="alternate" with a leading
                      word boundary 
)                   : end positive lookahead
(?=[^>\n]*\bmedia=)           : similar to above
(?=[^>\n]+\bmax-width:)       : similar to above
(?=[^>\n]+\b640px\b)          : similar to above
(?=[^>\n]+\bonly[ -]screen\b) : similar to above
(?=[^>\n]*\bhref=)            : similar to above
[^>\n]+ *>          : match 1+ characters other than '>'
                      and line terminator, followed by
                    : 0+ spaces, '>' 

The six positive lookaheads can be in any order.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • I was so long struggling to find exactly what you are suggesting. Thank you sooo much!!! Will play with it tonight :-) – Shopsi Jul 03 '20 at 06:11
  • Couldn't get it completely running yet, but I think I could manage to make a few changes to get most if it work in the online testing tool on regex101.com: (< *link *)(?=[^>\n]*\b(rel=\"alternate\"))(?=[^>\n]*\b(media=\"))(?=[^>\n]*\b(max-width))(?=[^>\n]*\b(640px))(?=[^>\n]*\b(href=\"))(?=[^>\n])(?=[^>\n]*)([^>\n]+ *>) Just in my real python code it still says there is a syntax error (invalid syntax) right with the first < character in the formula. – Shopsi Jul 03 '20 at 15:23
  • 1
    I will keep on trying by myself - just wanted to give feedback how it goes so far. Definetly Cary has helped me a lot to get where I am now. Thanks again. I will post the final solution once I have it :-) – Shopsi Jul 03 '20 at 15:27
  • Try entering just `(< link *)([^>\n]+ *>)`to confirms that works. Then test each lookahead separately, e.g. `(< link *)(?=[^>\n]*\b(max-width))([^>\n]+ *>)`. btw, I don't see the value of the capture groups. They just tell you, for example, that the capture group `(max-width)` contains `"max-width"` if there's a match. I do see why you might be interested in capturing `"640"` in `(max-width: 640px)`, which is doable. – Cary Swoveland Jul 03 '20 at 16:45
  • You might find it convenient to use the "Test String" section at regex101.com for saving parts or all of the regex you are testing. – Cary Swoveland Jul 03 '20 at 17:06
  • The Code Generator section helped. It seems adding r" to the regex helped, although the comments state that this is only necessary for Python2.x and I am sure to run Python3. The full working line is regex = r"(< *link *)(?=[^>\n]*\b(rel=\"alternate\"))(?=[^>\n]*\b(media=\"))(?=[^>\n]*\b(max-width))(?=[^>\n]*\b(640px))(?=[^>\n]*\b(href=\"))(?=[^>\n])(?=[^>\n]*)([^>\n]+ *>)" – Shopsi Jul 04 '20 at 04:34