4

I've been trying to figure out why .select("div.zn-body__paragraph") for jsoup hasn't been working on certain CNN articles. For articles like this it doesn't work despite clearly having that tag, whereas an article like this works. Here's the complete code I've written:


    public static String getContentCNN(String link) throws IOException{
        String finalString = "";

        Elements paragraphs = getDocsCNN(link).select("div.zn-body__paragraph");

        for (Element p : paragraphs) {
            finalString += p.text() + "\n\n";
        }


        return finalString;
    }

They both have divider classes like this:


<div class="zn-body__paragraph">Nadler on Wednesday said he didn't know the White House's motives, but he would not allow the White House to try to claim that the President cannot be held accountable.</div>

<div class="zn-body__paragraph">"I don't know whether they're trying to taunt us toward an impeachment or anything else," Nadler said. "All I know is they have made a preposterous claim."</div>

So far, I've tried div#class, div[class] & getElementByClass("class")

Thanks.

EDIT: Here is the source code for getDocsCNN():


public static Document getDocsCNN(String link) throws IOException{

        return Jsoup.connect(link).userAgent("Mozilla").timeout(6000).get();

    }

Dragonsnap
  • 834
  • 10
  • 25
  • 1
    How do you get the html from the site? Do you add the userAgent string? – TDG May 16 '19 at 18:07
  • @TDG Yep, I did. I did Mozilla, as I got the source from Firefox. I edited my original post to show the code for getDocsCNN. – Dragonsnap May 16 '19 at 19:25
  • 3
    The first link fetches the text with some javascript code, so jsoup cannot handle it. You can see it by toggling off javascript at your browser. Read about phantomJS or some other headless browser, it might help you. – TDG May 16 '19 at 19:55
  • TDG is correct. Jsoup only parses out the html as it was returned by server. It does not invoke JS (like the browser would). In this case, CNN is using JS to manipulate the DOM. – Zack May 17 '19 at 13:39
  • @Dragonsnap on a related note, not all of the CNN content is div.zn-body__paragraph, some of it is p.zn-body__paragraph. – Zack May 17 '19 at 13:40
  • see here for possible solution strategies: https://stackoverflow.com/questions/22390741/jsoup-get-dynamically-generated-html/22400976#22400976 – luksch May 20 '19 at 06:04

0 Answers0