multiple matches in a single pattern?

Question

I have input string which has strings like:

<image id="1234" caption="text1" alt="text2">...blah blah...

There can be multiple instances of such strings in the input.

I want to retrieve the attributes(caption, alt, etc) of such string alongwith the id and then print the id, alt, caption etc. There can be images without any attributes and just id.

Please advise.

score 3 · Answer 1 · edited May 23 '17 at 10:34

First things first: Don't parse xml or [x]html with regex, this is generally considered not to be a good approach.

But I understand that for quick+dirty applications, you don't want to deal with 3rd party libraries. But you have to consider the following questions, which make regex an even worse approach:

Is your xml valid or does it contain "broken" tags?
Are the attributes always in the same order? Or does caption sometimes occur before alt at any chance?
You already stated that some image tags only contain the id tag

These (and more) aspects determine the complexity of your regex solution. You need a double loop in order to get all the required data.

Find all the image tags: (<image[^>]+)> (this assumes there are no > characters in the attribute values)
Then, inside the image tags you found, use this: [ ]+([a-zA-Z0-9]+)="([^"]*)"

I hope you already see that this is quite messy and does not cover all the cases of valid xml!

A xml parser is always the correct way to go.

multiple matches in a single pattern?

1 Answers1