-6

I am trying to extract the text from an HTML page without using additional packages as it is actually a part of a cs course assignment. I am trying to write a method which omits any text between a '<' and a '>' and return anything that remains. I have a well-working method which extracts all page source and that method is on the parent class of the child class which I am currently working with.

public String getUnfilteredPageContents() {
    String last = "";
    String rawHTML = this.getPageContents();
    for(int i=0; i<rawHTML.length(); i++) {
      last = last + rawHTML.charAt(i);
      if(rawHTML.charAt(i) != '<') {
        while(rawHTML.charAt(i) != '>') {
          i++;
        }
      }
    }  
    return last;
}

Any help will be appreciated. Thank you in advance.

Nikolas Charalambidis
  • 40,893
  • 16
  • 117
  • 183
  • 2
    You should probably post the code you have already written. – bradimus Feb 22 '16 at 14:30
  • If you have code that is "well-working", what are you asking? – Scott Hunter Feb 22 '16 at 14:32
  • Seems this is a frequent question recent days: http://stackoverflow.com/questions/35532032/striphtmltags-exercise-in-java, http://stackoverflow.com/questions/35530304/regex-extract-a-href-attribute-from-html-with-special-name. What about using the search function first? [java extract html](http://stackoverflow.com/search?tab=newest&q=[java]%20extract%20html) – SubOptimal Feb 22 '16 at 14:34
  • @SubOptimal It says "without using additional packages" in the question. All other links ask for a solution via RegEx or Scanner (if I read it correctly which I assume I did). – Seth Feb 22 '16 at 14:38
  • @Seth is absolutely right. – icke_tlrtts Feb 22 '16 at 14:40
  • @SubOptimal what about reading the question first? – icke_tlrtts Feb 22 '16 at 14:41
  • @ScottHunter the well-working method is the method which extracts the whole page source, I am trying to add something to that to make a new method extract only the text. – icke_tlrtts Feb 22 '16 at 14:42
  • @orkalp When I post my first comment there was **no** code snippet from you in your question. And the requirement is close to the first link I posted. And on SO there are already so many questions to that context that I'm really sure there are pieces of code which you could use a basis. – SubOptimal Feb 22 '16 at 14:54

2 Answers2

2

Here is very naive solution.

  1. Load a webpage and put to very looooooong String.
  2. Delete everything between <> brackets included.
  3. Here is very simple Regex to spot a tag: string.replaceAll("\(<.*?>\)", "");
Nikolas Charalambidis
  • 40,893
  • 16
  • 117
  • 183
1

Something like this should do fine. You need to write a loop for it if you want to have it run more than just once per input.

String s = "lalala <Hello from the other side> lalala"; //Your input
       s= s.substring(s.indexOf("<") + 1);
       s= s.substring(0, s.indexOf(">"));

System.out.println(s); //prints the letters inside the brackets

Always ensure that you are not exceeding the String-length (.length-method) while looping.

Seth
  • 1,545
  • 1
  • 16
  • 30