1

I was reading the Oracle documentation for regular expressions and I can't seem to find something that I can use to replace the for loop below. I have scraped the body of an html web page but I am left with the html tags as well. Is there a regex command that allows you to replace everything beginning with a "<" and ending with ">" ? Essentially deleting the html tags altogether? The for loop does work, I was just hoping that I could find something cleaner.

Screen shot of working program

    char[] charWordsOfWebsite = wordsOfWebsite.toCharArray(); //wordsOfWebsite is the String I stored the html page into. Then store string as an array of characters.

    boolean insideHTMLTag = false;

    for (int i = 0; i <= charWordsOfWebsite.length-1 ; i++) {   //This loop gets rid of all html tags

        if (charWordsOfWebsite[i] == '<'){  //Beginning of html tag
            charWordsOfWebsite[i] = ' ';
            insideHTMLTag = true;
        } else if (insideHTMLTag && charWordsOfWebsite[i] != '>'){  //Inside html tag
            charWordsOfWebsite[i] = ' ';
        } else if (charWordsOfWebsite[i] == '>'){   //End of html tag
            charWordsOfWebsite[i] = ' ';
            insideHTMLTag = false;
        }
    }
    //Put char array into string, replace multiple white spaces with one white space, inverted regex replaces all characters except a-z, A-Z, 0-9, finally use setter to store the refined words string for later use.
    setRefinedWordsOfWebsite(new String(charWordsOfWebsite).trim().replaceAll("\\s{2,}", " ").replaceAll("[^a-zA-Z0-9\\s]", ""));
Arvind Kumar Avinash
  • 71,965
  • 6
  • 74
  • 110
Timothy
  • 29
  • 3
  • If you just want to remove the html tags and not the content within those tags, use the following regex `<[^>]+>`. You could use `.replaceAll()` method provided by `String` class to replace all occurrences of the html tags in a string. [Demo](https://regex101.com/r/6pdBt8/1) – Yousaf Jul 18 '20 at 14:13
  • 1
    [Don't Parse HTML With Regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Eritrean Jul 18 '20 at 14:53

1 Answers1

0

You can use the regex, <[^>]+> to match all HTML tags. The ^ inside a [ ] negates the pattern inside the [ ] and + is a quantifier which denotes one or more times.

Demo:

public class Main {
    public static void main(String[] args) {
        // Test string
        String str="<html>\n" + 
                "<head>\n" + 
                "   <title>Hello World</title>\n" + 
                "</head>\n" + 
                "<body>\n" + 
                "   The whole world is facing economic challenge due to Coronavirus pandemic.\n" + 
                "</body>\n" + 
                "</html>";

        str = str.replaceAll("<[^>]+>", "");
        System.out.println(str);
    }
}

Output:

Hello World


The whole world is facing economic challenge due to Coronavirus pandemic.

Check this for another demo.

Update:

Use the regex, <!--.+-->|<\[^>\]+> if you also want to match the pattern mentioned by VGR.

Arvind Kumar Avinash
  • 71,965
  • 6
  • 74
  • 110
  • This is valid HTML: `` – VGR Jul 18 '20 at 15:12
  • Thank you this worked great. I also used that pattern to exclude anything in between [ ] brackets too. Final code looks like this --> `setRefinedWordsOfWebsite(wordsOfWebsite.trim().replaceAll("\\[[^\\]]+\\]", " ").replaceAll("<[^>]+>"," ").replaceAll("([^a-zA-Z0-9\\s])", "").replaceAll("\\s{2,}", " "));` – Timothy Jul 18 '20 at 15:31
  • @VGR - Your point has been addressed in the update section. – Arvind Kumar Avinash Jul 18 '20 at 16:54