0

i would like to build a crawler in Java that give me all cookies from a website. This crawler is believed to crawl a list of websites (and obviously the undersides) automatic.

I have used jSoup and Selenium for my plan.

package com.mycompany.app;

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.*;

public class BasicWebCrawler {
    private static Set<String> uniqueURL = new HashSet<String>();
    private static List<String> link_list = new ArrayList<String>();

    private static Set<String> uniqueCookies = new HashSet<String>();

    private static void get_links(String url) {
        Connection connection = null;
        Connection.Response response = null;
        String this_link = null;

        try {
            connection = Jsoup.connect(url);
            response = connection.execute();

            //cookies_http = response.cookies();

            // fetch the document over HTTP
            Document doc = response.parse();

            // get all links in page
            Elements links = doc.select("a[href]");

            if(links.isEmpty()) {
                return;
            }

            for (Element link : links) {
                this_link = link.attr("href");

                boolean add = uniqueURL.add(this_link);

                System.out.println("\n" + this_link + "\n" + "title: " + doc.title());

                if (add && (this_link.contains(url))) {
                    System.out.println("\n" + this_link + "\n" + "title: " + doc.title());

                    link_list.add(this_link);

                    get_links(this_link);
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        get_links("https://de.wikipedia.org/wiki/Wikipedia");

        /**
         * Hier kommt Selenium ins Spiel
         */
        WebDriver driver;

        System.setProperty("webdriver.chrome.driver", "D:\\crawler\\driver\\chromedriver.exe");

        driver = new ChromeDriver();

        // create file named Cookies to store Login Information
        File file = new File("Cookies.data");

        FileWriter fileWrite = null;
        BufferedWriter Bwrite = null;

        try {
            // Delete old file if exists
            file.delete();
            file.createNewFile();

            fileWrite = new FileWriter(file);
            Bwrite = new BufferedWriter(fileWrite);
            // loop for getting the cookie information
        } catch (Exception ex) {
            ex.printStackTrace();
        }


        for(String link : link_list) {
            System.out.println("Open Link: " + link);

            driver.get(link);

            try {
                // loop for getting the cookie information
                for (Cookie ck : driver.manage().getCookies()) {
                    String tmp = (ck.getName() + ";" + ck.getValue() + ";" + ck.getDomain() + ";" + ck.getPath() + ";" + ck.getExpiry() + ";" + ck.isSecure());

                    if(uniqueCookies.add(tmp)) {
                        Bwrite.write("Link: " + link + "\n" + (ck.getName() + ";" + ck.getValue() + ";" + ck.getDomain() + ";" + ck.getPath() + ";" + ck.getExpiry() + ";" + ck.isSecure())+ "\n\n");
                        Bwrite.newLine();
                    }
                }
            } catch (Exception ex) {
                ex.printStackTrace();
            }
        }

        try {
            Bwrite.close();
            fileWrite.close();

            driver.close();
            driver.quit();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
}

I test this code on a wikipedia page and compare the result with a cookie scanner call CookieMetrix.

My code shows only four cookies:

Link: https://de.wikipedia.org/wiki/Wikipedia:Lizenzbestimmungen_Commons_Attribution-ShareAlike_3.0_Unported
GeoIP;DE:NW:M__nster:51.95:7.54:v4;.wikipedia.org;/;null;true


Link: https://de.wikipedia.org/wiki/Wikipedia:Lizenzbestimmungen_Commons_Attribution-ShareAlike_3.0_Unported
WMF-Last-Access-Global;13-May-2019;.wikipedia.org;/;Mon Jan 19 02:28:33 CET 1970;true


Link: https://de.wikipedia.org/wiki/Wikipedia:Lizenzbestimmungen_Commons_Attribution-ShareAlike_3.0_Unported
WMF-Last-Access;13-May-2019;de.wikipedia.org;/;Mon Jan 19 02:28:33 CET 1970;true


Link: https://de.wikipedia.org/wiki/Wikipedia:Lizenzbestimmungen_Commons_Attribution-ShareAlike_3.0_Unported
mwPhp7Seed;55e;de.wikipedia.org;/;Mon Jan 19 03:09:08 CET 1970;false

But the cookie scanner shows seven. I don't know why my code shows lesser than the CookieMetrix. Can you help me?

Basti G.
  • 411
  • 1
  • 5
  • 26
  • Your code is quite “messy” -- you should boil that down to the essentials. Most important: Are you using Selenium **or** jsoup? – qqilihq May 13 '19 at 19:57
  • I think that is the essential code because with fewer code nobody understand what the code do. In the code, you can see i used jsoup AND selenium. I use jsoup to crawl the links to all undersides and selenium to connect to all these sides to get the (javascript-) cookies. – Basti G. May 13 '19 at 21:41
  • Which 3 cookies does your crawler not get? – luksch May 14 '19 at 06:10
  • I did not analyze your code nor the wikipedia site, but could it be, that your crawling does not find all links, because they may be generated by JavaScript? Did you try a selenium only approach as well? If my idea is correct, with a full Selenium crawler you probably get all cookies. – luksch May 14 '19 at 06:14
  • Have you tried puppeteer or service like [cookieserve](https://www.cookieserve.com) – mujuonly Sep 24 '19 at 09:14

1 Answers1

0

JavaDoc for java.util.Set<Cookie> getCookies():

Get all the cookies for the current domain. This is the equivalent of calling "document.cookie" and parsing the result

  1. document.cookie will not return HttpOnly cookies, simply because JavaScript does not allow it.

  2. Also notice that the “CookieMetrix” seems to list cookies from different domains.

Solutions:

  • To get a listing such as “CookieMetrix” (1+2) you could add a proxy after your browser and sniff the requests.

  • In case you want to get all cookies for the current domain, including HttpOnly (1), you could try accessing Chrome’s DevTools API directly (afair, it’ll also return HttpOnly cookies)

qqilihq
  • 10,794
  • 7
  • 48
  • 89
  • So i must crawl the website with selenium to get all links also that generates by javascript and than i can open a connection with jSoup to get the HTTP response. Can i search in the response for "Set-cookie" to get the HttpOnly cookies? – Basti G. May 14 '19 at 15:07
  • I wouldn’t add Jsoup to the party. As I said, my solution would be to put the Selenium browser behind a proxy and sniff the headers. – qqilihq May 14 '19 at 17:25
  • I try to make a crawler with selenium only but i will have a StaleElementReferenceException. Can you help me in [this new thread](https://stackoverflow.com/questions/56150033/selenium-fire-staleelementreferenceexception)? – Basti G. May 15 '19 at 13:03