2

I am trying to scrape content of a web page using enlive's html-resource function, but I am getting response 403, because I am not coming from a browser.I guess this can be overridden in Java (found answer here) , but I would like to see a clojure way to handle this issue. Perhaps this can be achieved by providing parameters to html-resource function, but I have not encountered an example of how and what needs to be passed as parameter. Any suggestion will be greatly appreciated.

Thanks.

Community
  • 1
  • 1
Мitke
  • 310
  • 3
  • 17
  • 1
    Properly you need something like clj-http or http-kit that allows you to have control of the connection where you can provide some settings, get the response and feed to (html-resource) – Chiron Sep 08 '13 at 12:38
  • 1
    html-resource is a multimethod where you can pass a URL object to it. https://github.com/cgrand/enlive/blob/master/src/net/cgrand/enlive_html.clj#L112 This is a good point where you can set 'user-agent' to your URL connection object. – Chiron Sep 08 '13 at 12:44

1 Answers1

6

Enlive's html-resource does not provide a way to override the default request properties. You can, like the other answer you found, open the connection yourself and pass the resulting InputStream to html-resource.

Something like the following would handle it:

(with-open [inputstream (-> (java.net.URL. "http://www.example.com/")
                            .openConnection
                            (doto (.setRequestProperty "User-Agent"
                                                       "Mozilla/5.0 ..."))
                            .getContent)]
  (html-resource inputstream))

Although, it might look better split out into its own function.

Jared314
  • 5,181
  • 24
  • 30