How to extract text from an HTML page with java, using only loops and certain methods?

Question

I am trying to extract the text from an HTML page without using additional packages as it is actually a part of a cs course assignment. I am trying to write a method which omits any text between a '<' and a '>' and return anything that remains. I have a well-working method which extracts all page source and that method is on the parent class of the child class which I am currently working with.

public String getUnfilteredPageContents() {
    String last = "";
    String rawHTML = this.getPageContents();
    for(int i=0; i<rawHTML.length(); i++) {
      last = last + rawHTML.charAt(i);
      if(rawHTML.charAt(i) != '<') {
        while(rawHTML.charAt(i) != '>') {
          i++;
        }
      }
    }  
    return last;
}

Any help will be appreciated. Thank you in advance.

If you have code that is "well-working", what are you asking? — Scott Hunter, Feb 22 '16 at 14:32
Seems this is a frequent question recent days: http://stackoverflow.com/questions/35532032/striphtmltags-exercise-in-java, http://stackoverflow.com/questions/35530304/regex-extract-a-href-attribute-from-html-with-special-name. What about using the search function first? [java extract html](http://stackoverflow.com/search?tab=newest&q=[java]%20extract%20html) — SubOptimal, Feb 22 '16 at 14:34
@SubOptimal It says "without using additional packages" in the question. All other links ask for a solution via RegEx or Scanner (if I read it correctly which I assume I did). — Seth, Feb 22 '16 at 14:38
@ScottHunter the well-working method is the method which extracts the whole page source, I am trying to add something to that to make a new method extract only the text. — icke_tlrtts, Feb 22 '16 at 14:42
@orkalp When I post my first comment there was **no** code snippet from you in your question. And the requirement is close to the first link I posted. And on SO there are already so many questions to that context that I'm really sure there are pieces of code which you could use a basis. — SubOptimal, Feb 22 '16 at 14:54

Nikolas Charalambidis · Answer 1 · 2016-02-22T14:46:46.567

2

Here is very naive solution.

Load a webpage and put to very looooooong String.
Delete everything between <> brackets included.
Here is very simple Regex to spot a tag: string.replaceAll("\(<.*?>\)", "");

edited Feb 22 '16 at 14:46

answered Feb 22 '16 at 14:44

Nikolas Charalambidis

40,893
16
117
183

1

That's pretty darn neat, not gonna lie. Didn't think about that. – Seth Feb 22 '16 at 14:46
Neat indeed but I always apply naive solutions that are possibly wrong :D – Nikolas Charalambidis Feb 22 '16 at 14:48

score 1 · Answer 2 · answered Feb 22 '16 at 14:46

Something like this should do fine. You need to write a loop for it if you want to have it run more than just once per input.

String s = "lalala <Hello from the other side> lalala"; //Your input
       s= s.substring(s.indexOf("<") + 1);
       s= s.substring(0, s.indexOf(">"));

System.out.println(s); //prints the letters inside the brackets

Always ensure that you are not exceeding the String-length (.length-method) while looping.

How to extract text from an HTML page with java, using only loops and certain methods?

2 Answers2