Java web scanning to a text file

Question

I am new to web scraping with Java (I believe this is the correct term) and have been trying to find a good tutorial on what I am attempting:

I would like to have a class in the program I am creating that scans a given website for all its data and stores it. Then I can can use this data in my Main class.

I am asking that someone point me in the correct direction with the best tutorial for what I am asking OR that someone would be able to explain how I would programming this.

http://stackoverflow.com/questions/2835505/how-to-scan-a-website-or-page-for-info-and-bring-it-into-my-program — user3520080, Feb 12 '16 at 16:23
Please research your question before adding a new topic. This question in nearly the exact same terms has been asked before as posted by "user3520080". — jesric1029, Feb 12 '16 at 16:28
@jesric1029 i had seen this question before I asked but had trouble understand where to go from there. But thanks for the feedback. — phoenix, Feb 12 '16 at 16:43
Please edit your question adding what you just said and I would be happy to change my down-vote. I suggest always putting a disclaimer in your question that says "I did find the search results but don't understand them" so that people know why you have posted a duplicate and are less likely to down-vote. What do you not understand exactly? — jesric1029, Feb 12 '16 at 19:02

jesric1029 · Accepted Answer · 2016-02-12T19:37:29.087

Okay I'll try to answer this in a better way from the other. First let me say that if you aren't familiar with DOM parsing or any type of document parsing you will probably find this quite difficult.

The first thing your going to need to do is turn the HTML into a document. Using JSoup you can do this with:

 Document doc = Jsoup.connect("http://example.com")
  .data("query", "Java")
  .userAgent("Mozilla")
  .cookie("auth", "token")
  .timeout(3000)
  .post();

Now you have a document called "Doc". This document is going to be fully structured as the HTML obviously. In order to "parse" this document you are going to have to do some serious navigation. There is no magical "Parse entire document" code unfortunately. (Same goes for parsing XML, trust me I just had to parse an XML with over 100 nodes and it was time consuming).

So to navigate it would be very helpful if you have understanding of the structure of the HTML. You might consider using Print on "doc" so you can actually see what the HTML looks like before you go any further.

Once you know your tag names you can use a wide variety of methods like

getElementById(String id)

Of course you could save that to a String.

Your going to need to use loops and ArrayLists in situations where they are multiple tags of the same name.

I'm not going to go much further into the methods because your just really going to have to practice. I know using a DOM parser with XML the process I used was to getTextContent() but I'm not sure if that applies here.

Here is an example of how I used the DOM parser to parse an XML file (note that I used XPath to navigate my document which may be different than how you do it)

XPathExpression RfrdDocInfNbexpr = xpath.compile("//Ntfctn/Ntry/NtryDtls/TxDtls/RmtInf/Strd/RfrdDocInf/Nb");
            Object RfrdDocInfNb = RfrdDocInfNbexpr.evaluate(doc, XPathConstants.NODESET);
            NodeList nodesRfrdDocInfNb = (NodeList) RfrdDocInfNb;
            for(int i = 0; i < nodesRfrdDocInfNb.getLength(); i++){
                Element RfrdDocInfNbel = (Element) nodesRfrdDocInfNb.item(i);
                RfrdDocInfNbS = Utilities.xmlToString(RfrdDocInfNbel);
                int length = RfrdDocInfNbS.length();
                RfrdDocInfNbS = RfrdDocInfNbS.substring(42,length);
                length = RfrdDocInfNbS.length();
                RfrdDocInfNbS = RfrdDocInfNbS.substring(0,length-5);
                RfrdDocInfNbAL.add(RfrdDocInfNbS);

            }

So what did I do there?

XPathExpression RfrdDocInfNbexpr = xpath.compile("//Ntfctn/Ntry/NtryDtls/TxDtls/RmtInf/Strd/RfrdDocInf/Nb");

Sets the path of the element (also called a node) that I want to extract the value from.

Object RfrdDocInfNb = RfrdDocInfNbexpr.evaluate(doc, XPathConstants.NODESET);

Then create an object from that.

NodeList nodesRfrdDocInfNb = (NodeList) RfrdDocInfNb;

Creates a list of all those objects. (Since there may be multiple tags with the same name, in fact in my XML there were 60 of each tag).

Element RfrdDocInfNbel = (Element) nodesRfrdDocInfNb.item(i);

Turns my node into an element. Since your using HTML, you may be able to just BEGIN at this part - Getting an element is your objective.

RfrdDocInfNbS = Utilities.xmlToString(RfrdDocInfNbel);

This is important! This is how to turn an element into a String. I had a lot of trouble with this part but that turns the element into a String. Since your using HTML obviously this wont work but the point is you will have to figure out how to turn an HTML element into a String.

So that is how I used a parser to go through my XML and extract everything into ArrayLists and Strings. I had many blocks of code like that.

If you REALLY want to undertake this project I suggest doing research on the JSoup website here: http://jsoup.org/cookbook/extracting-data/dom-navigation.

And again, this is an advanced project so don't expect to understand this in a day I would expect it to take at least a week of reading and practice unless you are already familiar with parsing.

Java web scanning to a text file

1 Answers1