How would I obtain only the texts of a webpage that contain necessary keyword using JSoup?

Question

I came up with something like this which didn't work out. I am trying to extract the texts that contain the keyword alone and not the entire text of the webpage just because the webpage has that keyword.

    String pconcat="";

for (i = 0; i < urls.length; i++) {

Document doc=Jsoup.connect(urls[i]).ignoreContentType(true).timeout(60*1000).get();

        for(int x=0;x<keyWords.length;x++){
        if(doc.body().text().toLowerCase().contains(keyWords[x].toLowerCase())){
              Elements e=doc.select("body:contains("+keyWords[x]+")");
              for(Element element : e)
                {
                pconcat+=element.text();
                System.out.println("pconcat"+pconcat);
          }     
         }
        }
       }

Consider example.com , if the keyword I look for is "documents" , I need the output as "This domain is established to be used for illustrative examples in documents." and nothing else

Post example of input and output/result you are trying to find. For now we don't really know how you want to limit this result. — Pshemo, Jul 04 '16 at 11:50
So since you already have text from page simply iterate over all sentences and pick the ones which contain word you are looking for. This should be helpful: http://stackoverflow.com/a/2687929/1393766 — Pshemo, Jul 04 '16 at 12:13
Actually I am attempting to crawl a particular webpage and obtain content that match a specific keyword. More like keyword related web crawling.Your approach worked well for this page but not sure for all. Coz with example.com, it has just 2 sentence. Consider a random webpage that has links, menus,tabs, this approach mighnot seem worthy. Any idea? — Lalitha, Jul 05 '16 at 11:57

score 0 · Answer 1 · answered Aug 09 '16 at 15:33

You don't need to lowercase the body text in order to use the :contains selector, it is case insensitive.

elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants.

select() is only going to return elements if it finds a match.

elements that match the query (empty if none match)

You don't need an if-statement to check for "documents", just use css selectors to select any element that matches then do something with the results.

    Document doc = Jsoup
            .connect(url)
            .ignoreContentType(true)
            .timeout(60*1000)
            .get();

    for (String keyword : keywords) {

        String selector = String.format(
                "p:contains(%s)", 
                keyword.toLowerCase());

        String content = doc
                .select(selector)
                .text();

        System.out.println(content);

    }

Output

This domain is established to be used for illustrative examples in documents. You may use this domain in examples without prior coordination or asking for permission.

How would I obtain only the texts of a webpage that contain necessary keyword using JSoup?

1 Answers1