Find a string inside HTML code efficiently Java

Question

I would like to find a piece of text inside the HTML of a web page, as fast as possible, I think my procedure is the worst, but do you have any tips?

My code is like this:

public static void main(String[] args) throws Exception 
{
    URL url = new URL("http://stackoverflow.com/");
    BufferedReader in = new BufferedReader(
    new InputStreamReader(url.openStream()));

    String isPresent = "img";
    boolean on = false;

    String inputLine;
    while ((inputLine = in.readLine()) != null) 
    { 
         if(inputLine.contains(isPresent)) on = true;   //This takes a lot!!
    } 
 }

Since web pages have a lot of lines of HTML code and since I have few experience with HTML, the if(inputLine.contains(isPresent)) line, takes lot of seconds to be executed sometimes. Do you think is there a more efficient way in terms of time, to improve that? Thank you.

possible duplicate of [Parse Web Site HTML with JAVA](http://stackoverflow.com/questions/9071568/parse-web-site-html-with-java) — anotherdave, Aug 07 '14 at 13:50
I can't imagin that this `inputLine.contains(isPresent)` piece of code takes a lot of seconds. How did you find this out? I would say it is the network latency, try to differentiate between reading the stream and looking to the string `isPresent` and you can see what takes *lots of seconds*. Also break the loop as soon as you find the keyword. — A4L, Aug 07 '14 at 14:03

score 1 · Answer 1 · answered Aug 07 '14 at 13:48

1

you can exit the loop, as soon as on is set to true

To do this change your while condition

while ((inputLine = in.readLine()) != null && !on)

answered Aug 07 '14 at 13:48

Philipp Sander

10,139
6
45
78

score 0 · Answer 2 · answered Aug 07 '14 at 13:51

If its parsing that you mean try Jsoup. This way you could check for any tags, the occurance count etc etc.. Lost of possibilities.

Document doc = Jsoup.connect("http://stackoverflow.com/").get();
boolean on = false;
if(doc.select("img").size() > 0){
    on = true;
}

score 0 · Answer 3 · answered Aug 07 '14 at 14:11

0

You can usea java library that parse XML and HTML document , like JSoup, or HtmlUnit .Try the code below ,after adding JSoup binary to your classpath.

Document doc = Jsoup.connect("http://stackoverflow.com/").get();
String docContent=doc.text();
if(docContent.contains("searchedText"))
     on = true;

answered Aug 07 '14 at 14:11

Mifmif

3,132
18
23

Yea this works properly. Do you imagine why sometimes this simple piece of code runs an exception like this? "Exception in thread "main" java.util.zip.ZipException: Corrupt GZIP trailer" – Matt Aug 07 '14 at 16:26
I'm using nor files neither GZ libraries in my code. That's pretty hilarious. Sometimes it works fine instead. – Matt Aug 07 '14 at 16:28
The problem is give by Document class in Jsoup. I should investigate more carefully. – Matt Aug 07 '14 at 16:42

Find a string inside HTML code efficiently Java

3 Answers3