3

I'm trying to parse an HTML file and get all href's inside it.

So far, the code I'm using is:

(map 
   #(println (str "Match: " %)) 
   (re-find #"(?sm)href=\"([a-zA-Z.:/]+)\"" str_response))

str_response being the string with the HTML code inside it. According to my basic understanding of Clojure, that code should print a list of matches, but so far, no luck. It doens't crash, but it doens't match anything either. I've tried using re-seq instead of re-find, but with no luck. Any help?

Thanks!

Deleteman
  • 8,500
  • 6
  • 25
  • 39

3 Answers3

4

it is generally though that you cannot parse html with a regex (entertaining answer), though just finding all occurances of one tag should be dooable.

once you figure out the proper regex re-seq is the function you want to use:

user> (re-find #"aa" "aalkjkljaa")
"aa"
user> (re-seq #"aa" "aalkjkljaa")
("aa" "aa")

this is not crashing for you because re-find is returning nil which map is interpreting as an empty list and doing nothing

Community
  • 1
  • 1
Arthur Ulfeldt
  • 90,827
  • 27
  • 201
  • 284
  • Well... not so doable if you wanted to do it right. Want to exclude non-XML text quoted as CDATA? Want to exclude tags which belong to a different namespace? Etc. – Charles Duffy Jun 05 '12 at 02:23
  • you are completely correct: I highly recommend the linked answer on this topic :) "the center cannot hold..." – Arthur Ulfeldt Jun 05 '12 at 02:38
3

This really looks like an HTML scraping problem in which case, I would advise using enlive.

Something like this should work

(ns test.foo
  (:require [net.cgrand.enlive-html :as html]))

(let [url (html/html-resource
           (java.net.URL. "http://www.nytimes.com"))]
  (map #(-> % :attrs :href) (html/select url [:a])))
Julien Chastang
  • 17,592
  • 12
  • 63
  • 89
2

I don't think there is anything wrong with your code. Perhapsstr_responseis the suspect. The following works with http://google.com with your regex:

(let [str_response (slurp "http://google.com")]
  (map #(println (str "Match: " %)) 
   (re-seq #"(?sm)href=\"([a-zA-Z.:/]+)\"" str_response))

Note ref-find also works though it only returns one match.

jbear
  • 363
  • 2
  • 7
  • Thanks for the answer, for some reason, that code inside my project didn't print anything, I've decided to go with Julien's solution anyways. Thanks for taking the time! – Deleteman Jun 05 '12 at 13:58
  • You're welcome. As far as parsing html is concerned Chris Grand's enlive is the way to go. – jbear Jun 05 '12 at 22:52