0

I am trying to parse html fragment to find if there are any whitespaces inside 'href' or 'src' attributes of tags. So far i managed to come up with this regular expression:

(src|href)(\s*=\s*)(["'])(.+(?=\s).+)\3

But it can go false positive if there is a whitespace after closing quote symbol, which makes it kinda useless. How can it be modified ?

Example: https://regex101.com/r/JXp6pZ/1

user2816626
  • 83
  • 1
  • 9
  • 2
    [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) - use a parser. [`(?:src|href)\s*=\s*(["']?)\S*\1`](https://regex101.com/r/JXp6pZ/2) should work for you though – ctwheels Jan 24 '18 at 15:45
  • Do you mean something like this: https://regex101.com/r/JXp6pZ/3 – Srdjan M. Jan 24 '18 at 15:51
  • 1
    Thanks for your responses, but that does opposite of what is needed. It reutrns positive if there are no whitespaces, but it need positive only if it has whitespace somewhere in src or href. – user2816626 Jan 24 '18 at 15:56
  • 1
    So you want [`\b(?:src|href)\s*=\s*(?:'[^']*\s[^']*'|"[^"]*\s[^"]*")`](https://regex101.com/r/JXp6pZ/6)? – ctwheels Jan 24 '18 at 16:00
  • Yes, that's doing nice work, but i wonder if it can be any simpler ? – user2816626 Jan 24 '18 at 16:10
  • As pointed out above, do not use a regex. Use a parser. HTML is not regular; **regular** expressions are not a suitable tool for the job. – Tom Lord Jan 24 '18 at 16:16
  • 1
    `(?:src|href)=.[^'"]*\s` https://regex101.com/r/VVwW5y/1 – Scott Weaver Jan 24 '18 at 16:17

1 Answers1

1

you could try this pattern:

(?:src|href)=.[^'"]*\s[^'"]*['"]

*we use . for the first double/single quote because we don't care which it is, thereby making a slightly simpler pattern to read

Scott Weaver
  • 7,192
  • 2
  • 31
  • 43