8

You may react to this saying that HTML Parsing using regex is a totally bad idea, following this for example, and you are right.

But in my case, the following html node is created by our own server so we know that it will always look like this, and as the regex will be in a mobile android library, I don't want to use a library like Jsoup.

What I want to parse: <img src="myurl.jpg" width="12" height="32">

What should be parsed:

  • match a regular img tag, and group the src attribute value: <img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>
  • width and height attribute values: (width|height)\s*=\s*['"]([^'"]*)['"]*

So the first regex will have a #1 group with the img url, and the second regex will have two matches with subgroups of their values.

How can I merge both?

Desired output:

  • img url
  • width value
  • height value
Community
  • 1
  • 1
Hugo Gresse
  • 17,195
  • 9
  • 77
  • 119

3 Answers3

4

To match any img tag with src, height and width attributes that can come in any order and that are in fact optional, you can use

"(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^>]*?)\\3"

See the regex demo and an IDEONE Java demo:

String s = "<img height=\"132\" src=\"NEW_myurl.jpg\" width=\"112\"><link src=\"/test/test.css\"/><img src=\"myurl.jpg\" width=\"12\" height=\"32\">";
Pattern pattern = Pattern.compile("(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^\"]*)\\3");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    if (!matcher.group(1).isEmpty()) { // We have a new IMG tag
        System.out.println("\n--- NEW MATCH ---");  
    }
    System.out.println(matcher.group(2) + ": " + matcher.group(4));
} 

The regex details:

  • (<img\\b|(?!^)\\G) - the initial boundary matching the <img> tag start or the end of the previous successful match
  • [^>]*? - match any optional attributes we are not interested in (0+ characters other than > so as to stay inside the tag) -\\b(src|width|height)= - a whole word src=, width= or height=
  • ([\"']?) - a technical 3rd group to check the attribute value delimiter
  • ([^>]*?) - Group 4 containing the attribute value (0+ characters other than a > as few as possible up to the first
  • \\3 - attribute value delimiter matched with the Group 3 (NOTE if a delimiter may be empty, add (?=\\s|/?>) at the end of the pattern)

The logic:

  • Match the start of img tag
  • Then, match everything that is inside, but only capture the attributes we need
  • Since we are going to have multiple matches, not groups, we need to find a boundary for each new img tag. This is done by checking if the first group is not empty (if (!matcher.group(1).isEmpty()))
  • All there remains to do is to add a list for keeping matches.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

If you want to combine the both the things here is the answer.

<img\s+src="([^"]+)"\s+width="([^"]+)"\s+height="([^"]+)"

sample I tested

<img src="rakesh.jpg" width="25" height="45">

try this

Maheshwar Ligade
  • 6,709
  • 4
  • 42
  • 59
0

You may want this :

"(?i)(src|width|height)=\"(.*?)\""


Update:

I misunderstood your question, you need something like :

"(?i)<img\\s+src=\"(.*?)\"\\s+width=\"(.*?)\"\\s+height=\"(.*?)\">"

Regex101 Demo


Update 2

The regex below will capture the img tag attributes in any order:

"(?i)(?><img\\s+)src=\"(.*?)\"|width=\"(.*?)\"|height=\"(.*?)\">"  

Regex101 Demo v2

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268