-1

I need to match a single word (blah) inside the inner most quotes. Example:

<link rel="stylesheet" type="text/css" href="/BLAH/Test/Test/Test.css"> <script src="/blah/Test/Test/Test.js"></script> 

So I need it to return:

"/BLAH/Test/Test/Test.css"

"/blah/Test/Test/Test.js"

When I try to write something, it grabs the first and last double quote rather than seeing two instances of the word blah.

Any help would be appreciated but more than that please explain so I can learn!

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
jcaruso
  • 2,364
  • 1
  • 31
  • 64

3 Answers3

1

(<link.*href=['"]([^'"]*).*|<script.*src=['"]([^'"]*).*)

You can see it in action here

So what this does is it will first locate a link tag or a script tag. Then it looks for the href attribute in a link, or a src attribute in a script. Then it captures anything that is in quotes for that specific attribute. This will not work if you don't use quotes to define attributes.

You could also use my lookbehind method that I suggested in the comments, but I am told that lookbehinds are not vastly supported, so do so at your own risk.

emsimpson92
  • 1,779
  • 1
  • 9
  • 24
1

You will need something like a greedy regex that matches inside an opening tag then backtracks to find an occurrence asap (you should enable case-insensitivity flag i or go with [bB][lL][aA][hH]):

<\w+ [^>]*\w+="([^"]*?blah[^"]*)"

Live demo

Regex breakdown:

  • <\w+ Match a tag opening
  • [^>]* Match anything except >, zero or more times
  • \w+=" Match an attribute name following ="
  • ( Start of CG #1
    • [^"]*?blah[^"]* Match anything inside double quotes that contains word blah
  • ) End of CG #1
  • " Match "

Then you need to have an access to first capturing group. In a language like PHP this would be:

$str = <<<_
<link rel="stylesheet" type="text/css" href="/BLAH/Test/Test/Test.css">
<script src="/blah/Test/Test/Test.js"></script> 
_;

preg_match_all('~<\w+ [^>]*\w+="([^"]*?blah[^"]*)"~i', $str, $matches);
var_dump($matches[1]); // Here we dump captured group one
revo
  • 47,783
  • 14
  • 74
  • 117
  • The only problem with this is that it requires the html to use `" "` for attribute fields while `source="somesrc"`, `source='somesrc'`, and `source=somesrc` are all valid options. – emsimpson92 Jun 22 '18 at 21:24
  • I believe OP wanted to match the quoted string that contains `blah`, not the last quoted string regardless of content – Aprillion Jun 22 '18 at 21:26
  • @emsimpson92 OP's looking for double quotes as they stated, but if they need this I'll edit. – revo Jun 22 '18 at 21:29
  • I know they are, but it's always a good idea to throw out that little disclaimer so they aren't blindsided if they run it on someone else's html. – emsimpson92 Jun 22 '18 at 21:31
  • @Aprillion Missed that. Fixed. Thank you. – revo Jun 22 '18 at 21:35
  • @emsimpson92 I could guess a lot but I wouldn't do that. I appreciate your comment. OP will probably read comments here. Let's see what they need. – revo Jun 22 '18 at 21:37
0

if you use "(.*)" to match e.g. a="aa" b="bb", you would get aa" b="bb because * is a greedy operator - see e.g. What do 'lazy' and 'greedy' mean in the context of regular expressions?

you can use lazy operators, e.g. *? in "(.*?)" or a greedy operator with an expression that will match everything except a quote, e.g. "([^"]*)"

Aprillion
  • 21,510
  • 5
  • 55
  • 89
  • To add to this, you can use lookbehinds to select the specific attributes you're looking for. [As you can see here...](https://regex101.com/r/xb3EG6/7) – emsimpson92 Jun 22 '18 at 21:38