2

I have to remove data between two string as below

<PACKET>752</PACKET> 
  <TIME>23-Oct-2013 12:05:46 GMT Standard Time</TIME> 
  <INTERVAL>2</INTERVAL> 

<HEADER>hi this should not be printed only</HEADER>
<DATA></DATA>

In this I have to remove data between <HEADER> and </HEADER> .
Can any body give me regex for this?

agarwal_achhnera
  • 2,582
  • 9
  • 55
  • 84
  • 5
    Using Regex for such problems is not recommended, instead use [HTML Parser](http://stackoverflow.com/questions/2168610/which-html-parser-is-best). – Maroun Oct 23 '13 at 11:32

3 Answers3

3

I think this can do the job with RegEx:

String str="b1<HEADER>aaaaa</HEADER>b2";
String newstring = str.replaceAll("<HEADER[^>]*>([^<]*)<\\/HEADER>", "");
System.out.println(newstring);

This prints b1b2

In the case that you have other tags inside <HEADER> the above will fail. Consider the below example :

String str = "b1<HEADER>aa<xxx>xx</xxx>aaa</HEADER>b2";
String newstring = str.replaceAll("<HEADER[^>]*>([^<]*)<\\/HEADER>", "");
System.out.println(newstring);

This prints: b1<HEADER>aa<xxx>xx</xxx>aaa</HEADER>b2

To overcome this and remove also the containing tags use this:

newstring = str.replaceAll("<HEADER.+?>([^<]*)</HEADER>", "");

This will print b1b2.

Maroun
  • 94,125
  • 30
  • 188
  • 241
MaVRoSCy
  • 17,747
  • 15
  • 82
  • 125
1

Maroun's right that it's not a good idea, but if you have to do it then this might work:

(?ms)(.*<HEADER>).*(<\/HEADER>.*)

This captures everything up to and including <HEADER> in group 1, and everything from </HEADER> onwards in group 2. You can then concatenate the two to remove the bit in the middle.

See here: http://regex101.com/r/bC2eQ7

Stuart Golodetz
  • 20,238
  • 4
  • 51
  • 80
  • I use System.out.println(str.replaceAll("(?ms)(.*
    ).*(<\\/HEADER>.*)", "")); but not work.
    – agarwal_achhnera Oct 23 '13 at 11:52
  • @dilip_jindal: That's because you're replacing the whole string with an empty string if you do that. As mentioned, you need to use the capture groups, which you can't do with `String.replaceAll`. You need to use a matcher: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html – Stuart Golodetz Oct 23 '13 at 13:23
0

This RegEx replaces everything inside the tag with en empty String:

String input = "<PACKET>752</PACKET>...<HEADER>hi this should be printed only</HEADER><DATA></DATA>";
String output = input.replaceAll("(?<=<HEADER>).*?(?=</HEADER>)", "");

Result:

<PACKET>752</PACKET>...<HEADER></HEADER><DATA></DATA>