4

Lets say I have an XML in the form of a string. I wish to remove the content between two tags within the XML String, say . I have tried:

String newString = oldString.replaceFirst("\\<tagName>.*?\\<//tagName>",
                                                              "Content Removed");

but it does not work. Any pointers as to what am I doing wrong?

TookTheRook
  • 817
  • 4
  • 14
  • 31

3 Answers3

10

OK, apart from the obvious answer (don't parse XML with regex), maybe we can fix this:

String newString = oldString.replaceFirst("(?s)<tagName[^>]*>.*?</tagName>",
                                          "Content Removed");

Explanation:

(?s)             # turn single-line mode on (otherwise '.' won't match '\n')
<tagName         # remove unnecessary (and perhaps erroneous) escapes
[^>]*            # allow optional attributes
>.*?</tagName>   

Are you sure your matching the tag case correctly? Perhaps you also want to add the i flag to the pattern: (?si)

Community
  • 1
  • 1
Sean Patrick Floyd
  • 292,901
  • 67
  • 465
  • 588
  • In the end, simply using string.replaceFirst(".*", "Content Removed"); worked fine, I don't know why I was making it so complicated. Thanks for explaining the regex attributes in Java though, pretty helpful! – TookTheRook Jun 27 '11 at 15:43
0

Probably the problem lies here:

<//tagName>

Try changing it to

<\/tagName>

Pablo Fernandez
  • 103,170
  • 56
  • 192
  • 232
0

XML is a grammar; regular expressions are not the best tools to work with grammars.

My advice would be working with a real parser to work with the DOM instead of doing matches

For example, if you have:

<xml>
 <items>
  <myItem>
     <tagtoRemove>something1</tagToRemove>
  </myItem>
  <myItem>
     <tagtoRemove>something2</tagToRemove>
  </myItem>
 </items>

A regex could try to match it (due to the greedy mechanism)

<xml>
 <items>
  <myItem>
     matchString
  </myItem>
 </items>

Also, some uses that some DTDs may allow (such as <tagToRemove/> or <tagToRemove attr="value">) make catching tags with regex more difficult.

Unless it is very clear to you that none of the above may occur (nor or in the future) I would go with a parser.

SJuan76
  • 24,532
  • 6
  • 47
  • 87