I was reading the Oracle documentation for regular expressions and I can't seem to find something that I can use to replace the for loop below. I have scraped the body of an html web page but I am left with the html tags as well. Is there a regex command that allows you to replace everything beginning with a "<" and ending with ">" ? Essentially deleting the html tags altogether? The for loop does work, I was just hoping that I could find something cleaner.
char[] charWordsOfWebsite = wordsOfWebsite.toCharArray(); //wordsOfWebsite is the String I stored the html page into. Then store string as an array of characters.
boolean insideHTMLTag = false;
for (int i = 0; i <= charWordsOfWebsite.length-1 ; i++) { //This loop gets rid of all html tags
if (charWordsOfWebsite[i] == '<'){ //Beginning of html tag
charWordsOfWebsite[i] = ' ';
insideHTMLTag = true;
} else if (insideHTMLTag && charWordsOfWebsite[i] != '>'){ //Inside html tag
charWordsOfWebsite[i] = ' ';
} else if (charWordsOfWebsite[i] == '>'){ //End of html tag
charWordsOfWebsite[i] = ' ';
insideHTMLTag = false;
}
}
//Put char array into string, replace multiple white spaces with one white space, inverted regex replaces all characters except a-z, A-Z, 0-9, finally use setter to store the refined words string for later use.
setRefinedWordsOfWebsite(new String(charWordsOfWebsite).trim().replaceAll("\\s{2,}", " ").replaceAll("[^a-zA-Z0-9\\s]", ""));