How to check if html document contains string

Question

What would be a fast way to check if an URL contains a given string? I tried jsoup and pattern matching, but is there a faster way.

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupTest {

    public static void main(String[] args) throws Exception {

        String url = "https://en.wikipedia.org/wiki/Hawaii";
        Document doc = Jsoup.connect(url).get();
        String html = doc.html();

        Pattern pattern = Pattern.compile("<h2>Contents</h2>");
        Matcher matcher = pattern.matcher(html);
        if (matcher.find()) {
            System.out.println("Found it");
        }
    }
}

Why do you compile a pattern? If the `html` is a `String` and your `pattern` is a `String`, you could simply use [`html.contains(pattern)`](https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#contains-java.lang.CharSequence-). — Turing85, Jul 22 '15 at 20:54
Your title is misleading. You want to check if the retrieved document contains the string, not the URL. — Chthonic Project, Jul 22 '15 at 20:55
Generally, if your code works and you are looking for review to find better way you should choose http://codereview.stackexchange.com/ over https://stackoverflow.com/ — Pshemo, Jul 22 '15 at 21:13
possible duplicate of [what is the fastest substring search method in Java](http://stackoverflow.com/questions/18340097/what-is-the-fastest-substring-search-method-in-java) — Alkis Kalogeris, Jul 22 '15 at 22:04

score 0 · Answer 1 · answered Jul 23 '15 at 16:18

It depends. If your patterns is really only a simple substring to be found exactly in the page content, then both methods you suggest are overkill. If that is indeed the case you should get the page without parsing it in JSoup. You still can use Jsoup if you want to get the page, just don't start the parser:

Connection con = Jsoup.connect("https://en.wikipedia.org/wiki/Hawaii");
Response res = con.execute();   
String rawPageStr = res.body();

if (rawPageStr.contains("<h2>Contents</h2>")){
  //do whatever you need to do
}

If the pattern is indeed a regular expression, use this:

Pattern pattern = Pattern.compile("<h2>\\s*Contents\\s*</h2>");
Matcher matcher = pattern.matcher(rawPageStr);

This makes only sense, if you do not need to parse much more of the page. However, if you actually want to perform a structured search of the DOM via CSS selectors, JSoup is not a bad choice, although a SAX based approach like TagSoup probably could be a bit faster.

Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/Hawaii").get();
Elements h2s = doc.select("h2");
for (Element h2 : h2s){
  if (h2.text().equals("Contents")){
    //do whatever & more
  }
}

How to check if html document contains string

1 Answers1