0

I am trying to apply a java regex to the following text to extract the content but the problem is that when there is only one href in the text it find the content fine, but when there is more, then it goes to the end of the text. here is the regex pattern:

Pattern pattern = Pattern.compile("\\\"\\>(.*)\\</a\\>\\<br\\>", Pattern.DOTALL);

here is the text :

<div><b>Attachments:</b> <a href="http://projectspace.intranet.group/sites/CFY366N/Lists/Deliverables/Attachments/8/1.JPG">http://projectspace.intranet.group/sites/CFY366N/Lists/Deliverables/Attachments/8/1.JPG</a><br><a href="http://projectspace.intranet.group/sites/CFY366N/Lists/Deliverables/Attachments/8/yinYang.gif">http://projectspace.intranet.group/sites/CFY366N/Lists/Deliverables/Attachments/8/yinYang.gif</a><br><a href=""></a></div>

so if there is only the href for 1.JPG then it find the right answer:

http://projectspace.intranet.group/sites/CFY366N/Lists/Deliverables/Attachments/8/1.JPG

but when I add the yinYang.gif then if find the following :

">http://projectspace.intranet.group/sites/CFY366N/Lists/Deliverables/Attachments/8/1.JPG</a><br><a href="http://projectspace.intranet.group/sites/CFY366N/Lists/Deliverables/Attachments/8/yinYang.gif">http://projectspace.intranet.group/sites/CFY366N/Lists/Deliverables/Attachments/8/yinYang.gif</a><br>

How can I change this to find all the values between <a> ...</a> in different groups.

Ali
  • 56,466
  • 29
  • 168
  • 265
user261002
  • 2,182
  • 10
  • 46
  • 73

1 Answers1

1

Change your pattern into a non-greedy one:

"\\\"\\>(.*?)\\</a\\>\\<br\\>"

However, six words of warning are appropriate: don't do it this way.

you are essentially trying to parse (semi-)structured information using regular expression. experience tells, you are doomed if you follow this route. either regexen will prove not to be powerful enough to solve your problem in the end (think of nested structures) or you will produce unmaintainable code. probably both.

collapsar
  • 17,010
  • 4
  • 35
  • 61