1

I have a large html. I want to remove a specific span tag which can be straightforward as.

<span class=GramE> blah blah blah</span>
Output: bla bla bla

OR

<span class=a><span class=GramE>bla bla bla</span></span>
Output: <span class=a>bla bla bla</span>

Or in any other intermingled format. However, it should preserve the text between ...

Actual html

<td width=265 colspan=3 valign=top style='width:7.0cm;background:white;
 padding:0cm 5.75pt 0cm 5.75pt'> <p class=MsoNormal style='margin-bottom:0cm;margin-bottom:.0001pt;text-align:justify;line-height:normal'><span class=GramE><span style='font-size:13.0pt'>(Here</span></span><span style='font-size:13.0pt'> Lorem ispsum. Lorem ispsum. Lorem ispsum. Lorem ispsum )</span></p>
            </td>

I have tried the following code however, the replaceAll() doesnt seem to work. There are many intermingled span-tags in my html text which need this output. Please help me figure out where am I going wrong.

String filename = "file-location.html";
try (BufferedReader br = new BufferedReader(new FileReader(filename))) {

        String line;
        String sb = "";

        while ((line = br.readLine()) != null) {

            String tmp = line.replaceAll("<span class=GramE[^>]*>/g", "");
            System.out.print(tmp);
        }

    } catch (IOException e) {
        e.printStackTrace();
    } 
Nevermore
  • 882
  • 2
  • 18
  • 44

2 Answers2

2

Based on RegEx match open tags except XHTML self-contained tags (thanks to @Maurice Perry's comment)

I recommend you to use jsoup, as showed here: Parse html with jsoup and remove the tag block

Community
  • 1
  • 1
D.Kastier
  • 2,640
  • 3
  • 25
  • 40
1

This answer was done before actual html is added to the question. JSoup address grammatical issues, when RegEx may address lexical issues. So, for this problem, using JSoup is the only way.

However, this answer may help RegEx users:

line.replaceAll("<span class=GramE>([^<]*)</span>", "$1" );

([^<]*) is a capturing group and $1 is its value.

see the documentation.

Test case:

public class RemoveTagFromPage {

   public static void main( String[] args ) {
      final String text =
         "<html><body>" +
            "<p>hello</p>" +
            "<span class=a>" +
               "<span class=GramE>bla bla bla</span>" +
            "</span>" +
         "</body></html>";
      System.out.println(
         text.replaceAll("<span class=GramE>([^<]*)</span>", "$1" ));
   }
}

Execution log:

<html><body><p>hello</p><span class=a>bla bla bla</span></body></html>
Aubin
  • 14,617
  • 9
  • 61
  • 84