0

I would like to find a piece of text inside the HTML of a web page, as fast as possible, I think my procedure is the worst, but do you have any tips?

My code is like this:

public static void main(String[] args) throws Exception 
{
    URL url = new URL("http://stackoverflow.com/");
    BufferedReader in = new BufferedReader(
    new InputStreamReader(url.openStream()));

    String isPresent = "img";
    boolean on = false;

    String inputLine;
    while ((inputLine = in.readLine()) != null) 
    { 
         if(inputLine.contains(isPresent)) on = true;   //This takes a lot!!
    } 
 }

Since web pages have a lot of lines of HTML code and since I have few experience with HTML, the if(inputLine.contains(isPresent)) line, takes lot of seconds to be executed sometimes. Do you think is there a more efficient way in terms of time, to improve that? Thank you.

Matt
  • 773
  • 2
  • 15
  • 32
  • 1
    possible duplicate of [Parse Web Site HTML with JAVA](http://stackoverflow.com/questions/9071568/parse-web-site-html-with-java) – anotherdave Aug 07 '14 at 13:50
  • I can't imagin that this `inputLine.contains(isPresent)` piece of code takes a lot of seconds. How did you find this out? I would say it is the network latency, try to differentiate between reading the stream and looking to the string `isPresent` and you can see what takes *lots of seconds*. Also break the loop as soon as you find the keyword. – A4L Aug 07 '14 at 14:03

3 Answers3

1

you can exit the loop, as soon as on is set to true

To do this change your while condition

while ((inputLine = in.readLine()) != null && !on) 
Philipp Sander
  • 10,139
  • 6
  • 45
  • 78
0

If its parsing that you mean try Jsoup. This way you could check for any tags, the occurance count etc etc.. Lost of possibilities.

Document doc = Jsoup.connect("http://stackoverflow.com/").get();
boolean on = false;
if(doc.select("img").size() > 0){
    on = true;
} 
Syam S
  • 8,421
  • 1
  • 26
  • 36
0

You can usea java library that parse XML and HTML document , like JSoup, or HtmlUnit .Try the code below ,after adding JSoup binary to your classpath.

Document doc = Jsoup.connect("http://stackoverflow.com/").get();
String docContent=doc.text();
if(docContent.contains("searchedText"))
     on = true;
Mifmif
  • 3,132
  • 18
  • 23
  • Yea this works properly. Do you imagine why sometimes this simple piece of code runs an exception like this? "Exception in thread "main" java.util.zip.ZipException: Corrupt GZIP trailer" – Matt Aug 07 '14 at 16:26
  • I'm using nor files neither GZ libraries in my code. That's pretty hilarious. Sometimes it works fine instead. – Matt Aug 07 '14 at 16:28
  • The problem is give by Document class in Jsoup. I should investigate more carefully. – Matt Aug 07 '14 at 16:42