20

I am trying to write a pattern for extracting the path for files found in img tags in HTML.

String string = "<img src=\"file:/C:/Documents and Settings/elundqvist/My Documents/My Pictures/import dialog step 1.JPG\" border=\"0\" />";

My Pattern:

src\\s*=\\s*\"(.+)\"

Problem is that my pattern will also include the 'border="0" part of the img tag.

What pattern would match the URI path for this file without including the 'border="0"?

Brad Mace
  • 27,194
  • 17
  • 102
  • 148
willcodejavaforfood
  • 43,223
  • 17
  • 81
  • 111

7 Answers7

48

Your pattern should be (unescaped):

src\s*=\s*"(.+?)"

The important part is the added question mark that matches the group as few times as possible

Sebastian Dietz
  • 5,587
  • 1
  • 31
  • 39
18

This one only grabs the src only if it's inside of an tag and not when it is written anywhere else as plain text. It also checks if you've added other attributes before or after the src attribute.

Also, it determines whether you're using single (') or double (") quotes.

\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>

So for PHP you would do:

preg_match("/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/", $string, $matches);
echo "$matches[1]";

for JavaScript you would do:

var match = text.match(/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/)
alert(match[1]);

Hopefully that helps.

Alfonse
  • 732
  • 1
  • 6
  • 7
  • 2
    This regex will NOT work if the ending of `src` tag has is last and stuck to the closing of img tag. For example: `` <=== will not work – Muhammad Reda Dec 14 '21 at 19:55
  • To enable this regex for tags like ``, it can be fixed by replacing `+` with `*` in the last group: `/\/` – kernelpicnic Apr 25 '23 at 08:40
10

Try this expression:

src\s*=\s*"([^"]+)"
mjk
  • 2,443
  • 4
  • 33
  • 33
Gumbo
  • 643,351
  • 109
  • 780
  • 844
7

I solved it by using this regex.

/<img.*?src="(.*?)"/g

Validated in https://regex101.com/r/aVBUOo/1

Naveen Murthy
  • 3,661
  • 2
  • 21
  • 22
0

I'd like to expand on this topic as usually the src attribute comes unquoted so the regex to take the quoted and unquoted src attribute is:
src\s*=\s*"?(.+?)["|\s]

Brlja
  • 364
  • 3
  • 14
0

You want to play with the greedy form of group-capture. Something like

src\\s*=\\s*\"(.+)?\"

By default the regex will try and match as much as possible

oxbow_lakes
  • 133,303
  • 56
  • 317
  • 449
0

I am trying to write a pattern for extracting the path for files found in img tags in HTML.

Can we have an autoresponder for "Don't use regex to parse [X]HTML"?

Problem is that my pattern will also include the 'border="0" part of the img tag.

Not to mention any time 'src="' appears in plain text!

If you know in advance the exact format of the HTML you're going to be parsing (eg. because you generated it yourself), you can get away with it. But otherwise, regex is entirely the wrong tool for the job.

bobince
  • 528,062
  • 107
  • 651
  • 834