First things first: Don't parse xml or [x]html with regex, this is generally considered not to be a good approach.
But I understand that for quick+dirty applications, you don't want to deal with 3rd party libraries. But you have to consider the following questions, which make regex an even worse approach:
- Is your xml valid or does it contain "broken" tags?
- Are the attributes always in the same order? Or does
caption
sometimes occur before alt
at any chance?
- You already stated that some
image
tags only contain the id tag
These (and more) aspects determine the complexity of your regex solution.
You need a double loop in order to get all the required data.
- Find all the image tags:
(<image[^>]+)>
(this assumes there are no >
characters in the attribute values)
- Then, inside the
image
tags you found, use this: [ ]+([a-zA-Z0-9]+)="([^"]*)"
I hope you already see that this is quite messy and does not cover all the cases of valid xml!
A xml parser is always the correct way to go.