Scanning and displaying every word from a website source code Java

Question

I have been given a task to scan the contents of a website's source code, and use delimiters to extract all hyperlinks from the site and display them. After some looking around online this is what I have so far:

    import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Scanner;

    public class HyperlinkMain {
public static void main(String[] args) {
    try {
        Scanner in = new Scanner (System.in);
        String URL = in.next();

        URL website = new URL(URL);
        BufferedReader input = new BufferedReader(new InputStreamReader(website.openStream()));
        String inputLine; 

        while ((inputLine = input.readLine()) != null) {
            // Process each line.
            System.out.println(inputLine);
        }
        in.close(); 

    } catch (MalformedURLException me) {
        System.out.println(me); 

    } catch (IOException ioe) {
        System.out.println(ioe);
    }
}

}

So my program can extract each line from the source code of a website and display it, but realistically I want it to extract each WORD as such from the source code rather than every line. I don't really know how it's done because I keep getting errors when I use input.read();

I'm seeing two different requirements: "Extract all hyperlinks" or "Extract all words". Which of those two are you attempting to accomplish? — Gus, Feb 19 '14 at 17:09
I have to extract all hyperlinks, however to do that I think that I should have to extract all words, and then search for the ones containing " etc — user3275341, Feb 19 '14 at 17:16
I don't think you'll need to extract all words first. Just slurp the whole file into a single string and look for everything matching your favorite hyperlink regex. — Gus, Feb 19 '14 at 17:36
Also I'd be remiss if I didn't warn you [html is not a regular language](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). — Gus, Feb 19 '14 at 17:47

score 1 · Answer 1 · answered Feb 19 '14 at 17:05

1

There is lots of source code around to retrieve web pages. Look at the Pattern class to see how to regex text for hyperlinks. You can treat your homework assignment as two separate problems by working on the hyperlink extraction separately from the web page downloads.

answered Feb 19 '14 at 17:05

Michael Shopsin

2,055
2
24
43

Splitting the task into two problems is a really good suggestion. – Gus Feb 19 '14 at 17:07
@Gus NP, breaking down problems is a lot of what programming is about, other than debugging. A small hint, [regex](http://regextester.com/) can find multiple instances of a word or pattern, which makes life much easier. – Michael Shopsin Feb 19 '14 at 17:22

Scanning and displaying every word from a website source code Java

1 Answers1