1

I am trying to capture the h1, h2, and h3 tags for the following HTML pages, but H3 is only returned on the first URL and doesn't return for the second URL.

URL (returns H3) = https://docs.paloaltonetworks.com/prisma/prisma-access/prisma-access-panorama-release-notes/prisma-access-about/features-in-prisma-access

URL (doesn't return H3) = https://docs.paloaltonetworks.com/pan-os/10-2/pan-os-admin/authentication/configure-multi-factor-authentication/configure-mfa-between-rsa-securid-and-firewall

String url = "https://docs.paloaltonetworks.com/pan-os/10-2/pan-os-admin/authentication/configure-multi-factor-authentication/configure-mfa-between-rsa-securid-and-firewall";

try {
            Document html = Jsoup.connect(url).userAgent("Mozilla").get();

                Elements hTags = html.select("h1,h2,h3");
                System.out.println(hTags);


        } catch (IOException e) {
            System.out.println("In exception " + e);
            throw new RuntimeException(e);
        }

If I View Page Source for both HTML files, the H3 headers do not show up, however, both HTML pages show the H3 headers when I inspect the page. Any help would be appreciated.

Kali Linux
  • 13
  • 5
  • Possibly related: [Page content is loaded with JavaScript and Jsoup doesn't see it](https://stackoverflow.com/q/7488872) – Pshemo Dec 05 '22 at 21:06

1 Answers1

1

When I download the plain HTML (e.g. "View page source"), I can't find any H3 headers, neither.

But when I use the developer tools (in Firefox opens on F12), I can find H3 headers.

Having said this: the H3 headers are loaded dynamically after the page is loaded. JSoup does not automatically evaluate those scripts which will load more content. Therefore you won't get these values.

So to conclude and to cite from the linked question: JSoup is a HTML parser and therefore unaware of any content that is loaded to the HTML after it has been loaded via any scripts. Also this discussion is mentioned: Is there a way to embed a browser in Java?

jmizv
  • 1,172
  • 2
  • 11
  • 28
  • Thanks @jmizv. My code successfully captures the H3 headers from another similar page, and that page doesn't show the H3 headers under "View page source" either. URL that returns H3 = https://docs.paloaltonetworks.com/prisma/prisma-access/prisma-access-panorama-release-notes/prisma-access-about/features-in-prisma-access URL doesn't return H3 = https://docs.paloaltonetworks.com/pan-os/10-2/pan-os-admin/authentication/configure-multi-factor-authentication/configure-mfa-between-rsa-securid-and-firewall Not sure why they are behaving differently. – Kali Linux Dec 05 '22 at 21:55