0

Could you please correct my regex?

I need to match all <img> tags which have a ?contextId inside of src. For instance the following string should be matched:

<img xmlns="http://www.w3.org/1999/xhtml" src="http://10.3.34.34:8080/Bilder/pic.png?contextId=qualifier123" alt="Bild" />

I wrote the regular expression and it does what I need:

(?i)<img[^>]+? src\s*?=\s*?"(.*?\?contextId.*?)"[^\/]+?\/>

But it seems to me it takes too many steps (380 here) to parse: regex demo

Input string can be up to 30,000 characters and I worry that Java regex engine may fail with my non-optimized expression.

Biffen
  • 6,249
  • 6
  • 28
  • 36
Aleks
  • 63
  • 6
  • What is the desired return? Do you want the whole `img` tag? Just the `src`? Something else? – sco1 Apr 08 '16 at 12:44
  • 8
    [Don't parse HTML with regex!](http://stackoverflow.com/a/1732454/418066) – Biffen Apr 08 '16 at 12:45
  • Your regex is incorrect; the third example [here](https://regex101.com/r/iR2sQ0/1) is matched and should not be. Here's a solution (that also takes less steps because I removed some lazy modifiers): https://regex101.com/r/bF8eI9/1. Anyway, I think you should follow Biffen's advice above. – Cristian Lupascu Apr 08 '16 at 12:49
  • I need the both whole `` tag and link of `src`. – Aleks Apr 08 '16 at 12:52
  • Biffen, thank you! Should I use jsoup for parcing? – Aleks Apr 08 '16 at 13:12
  • Wolf, thank you! But the regex you sent in your link is the the same as mine. Could you please check it. May be it was not updated or saved? – Aleks Apr 08 '16 at 13:17
  • @Aleks jsoup is one option, but there are others. Pick your favourite. – Biffen Apr 08 '16 at 13:20

2 Answers2

1

98 steps (regex demo):

<img.*?src="[^"]+\?contextId[^>]+>

This regex makes the assumption that the html is not malformed and particularly expects that each img tag has a src attribute.

EDIT: 104 steps to take both the img and the src link (regex demo):

(<img.*?src="([^"]+\?contextId[^"]+)"[^>]+>)
dron22
  • 1,235
  • 10
  • 20
1

I made some changes to your regex:

<img.*?src\s*=\s*"([^"]*\?contextId[^"]*)

1)   *? to [^"]*    # match non "(double quotes) characters instead of .(dot)
2)  "[^\/]+?\/>     # no need to match this part

REGEX 101 DEMO

Quinn
  • 4,394
  • 2
  • 21
  • 19