-1

I have large documents which have some strings in them which look like this:

<font face='Greek1'>D</font>  

These are not full fledged html documents (I know RegEx and html is a big no-no), and they are well behaved on this point. The values in between >< are arbitrary.

The documents are large and I need to do a replace across them so that the line:

<font face='Greek1'>D</font>

looks instead like:

D

I've written this regex:

(<font face='[A-z0-9]*'>)

For pattern matching which takes care of the first section, for any face attribute. The

</font> 

is also pretty easy to code up.

If I have code that looks like this:

Pattern pattern = Pattern.compile(MYREGEX);
Matcher matcher = pattern.matcher(MYSTRING);
String clean = matcher.replaceAll("");

Is there a way to write a single pattern which will find and replace on both the first section:

   <font face='Greek1'>D</font>

and the second section:

   </font> 

While leaving whatever arbitrary characters are between the >< in place? Or do I have to do these as two seperate reg-exs?

Nathaniel D. Waggoner
  • 2,856
  • 2
  • 19
  • 41
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Mena Jul 02 '14 at 17:37
  • I would argue it's not a duplication, based on the types of exclusions. I want to find and replace on strings, while leaving parts of the replaced content untouched. The linked question is about a different kind of problem. – Nathaniel D. Waggoner Jul 02 '14 at 17:40

4 Answers4

1

You can just use <font face='[A-z0-9]*'>|</font> as the regex and it should replace both simultaneously.

Explosion Pills
  • 188,624
  • 52
  • 326
  • 405
1

For your specific example, this would work:

String s = "<font face='Greek1'>D</font>";
String value = s.replaceAll("(<.*?>)(.*?)(</.*?>)", "$2"); // D

In substance:

  • (<.*?>) matches the <...> part - the ? is there to prevent that regex from matching the whole string
  • the second group is your value
  • the third group is the closing tag
  • $2 refers to the second group, i.e. the value
assylias
  • 321,522
  • 82
  • 660
  • 783
0

You can try with Reluctant quantifiers

System.out.println("<font face='Greek1'>D</font>".replaceAll("<.*?>", "")); // D
Braj
  • 46,415
  • 5
  • 60
  • 76
0

You can use a non greedy regex and you can do the following:

String value = s.replaceAll(".*?>(\\w+)<.*?", "$1"); 

So, it will replace whatever you have ...>D<... and will keep D:

<font face='Greek1'>D</font>

By only D

If you want to only remove the exact text, then you can use:

String value = s.replaceAll("<font.*?>(\\w+)</font>", "$1"); 
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123