12

How can I make WebClient download external css stylesheets and image bodies just like a usual web browser does?

Fluffy
  • 27,504
  • 41
  • 151
  • 234

4 Answers4

6

What I'm doing right now is:

public static final HashMap<String, String> acceptTypes = new HashMap<String, String>(){{
        put("html", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        put("img", "image/png,image/*;q=0.8,*/*;q=0.5");
        put("script", "*/*");
        put("style", "text/css,*/*;q=0.1");
    }};

protected void downloadCssAndImages(HtmlPage page) {
        String xPathExpression = "//*[name() = 'img' or name() = 'link' and @type = 'text/css']";
        List<?> resultList = page.getByXPath(xPathExpression);

        Iterator<?> i = resultList.iterator();
        while (i.hasNext()) {
            try {
                HtmlElement el = (HtmlElement) i.next();

                String path = el.getAttribute("src").equals("")?el.getAttribute("href"):el.getAttribute("src");
                if (path == null || path.equals("")) continue;

                URL url = page.getFullyQualifiedUrl(path);

                WebRequestSettings wrs = new WebRequestSettings(url);
                wrs.setAdditionalHeader("Referer", page.getWebResponse().getRequestSettings().getUrl().toString());

                client.addRequestHeader("Accept", acceptTypes.get(el.getTagName().toLowerCase()));
                client.getPage(wrs);
            } catch (Exception e) {}
        }



client.removeRequestHeader("Accept");
}
Fluffy
  • 27,504
  • 41
  • 151
  • 234
1

source : How to get base64 encoded contents for an ImageReader?

HtmlImage img = (HtmlImage) p.getByXPath("//img").get(3);
ImageReader imageReader = img.getImageReader();
BufferedImage bufferedImage = imageReader.read(0);
String formatName = imageReader.getFormatName();
ByteArrayOutputStream byteaOutput = new ByteArrayOutputStream();
Base64OutputStream base64Output = new base64OutputStream(byteaOutput);
ImageIO.write(bufferedImage, formatName, base64output);
String base64 = new String(byteaOutput.toByteArray());
Community
  • 1
  • 1
jer
  • 11
  • 1
1

Here's what I came up with:

public InputStream httpGetLowLevel(URL url) throws IOException
{
    WebRequest wrq=new WebRequest(url);

    ProxyConfig config =webClient.getProxyConfig();

    //set request webproxy
    wrq.setProxyHost(config.getProxyHost());
    wrq.setProxyPort(config.getProxyPort());
    wrq.setCredentials(webClient.getCredentialsProvider().getCredentials(new AuthScope(config.getProxyHost(), config.getProxyPort())));
    for(Cookie c:webClient.getCookieManager().getCookies(url)){
        wrq.setAdditionalHeader("Cookie", c.toString());            
    }           
    WebResponse wr= webClient.getWebConnection().getResponse(wrq);
    return wr.getContentAsStream();
}

My tests show, that it does support proxys and that it not only carries cookies from WebClient, but also if server sends new cookies during the response, the WebClient will eat those cookies

Arsen Zahray
  • 24,367
  • 48
  • 131
  • 224
0

HtmlUnit does not download CSS or images. They are useless to a headless browser...

Last I heard of it is here, but the ticket is marked private: http://osdir.com/ml/java.htmlunit.devel/2007-01/msg00021.html

bshouse
  • 94
  • 3
  • 1
    What if the user wants to check the css or images with a headless browser? That seems to be what's implied by the question. I guess css and images wouldn't be useless, right? In fact, that's what led me to this question, it would be nice if I could use a headless browser to check an image by size or hash or a css for the values of a background color. Trying to help here... your answer comes off a little argumentative rather than constructive. – fooMonster Sep 15 '11 at 12:51