Getting all matches for a regexp on clojure

Question

I'm trying to parse an HTML file and get all href's inside it.

So far, the code I'm using is:

(map 
   #(println (str "Match: " %)) 
   (re-find #"(?sm)href=\"([a-zA-Z.:/]+)\"" str_response))

str_response being the string with the HTML code inside it. According to my basic understanding of Clojure, that code should print a list of matches, but so far, no luck. It doens't crash, but it doens't match anything either. I've tried using re-seq instead of re-find, but with no luck. Any help?

Thanks!

if you include the value of str_response in your question I can help with the regex, — Arthur Ulfeldt, Jun 04 '12 at 22:30

score 4 · Answer 1 · edited May 23 '17 at 12:28

4

it is generally though that you cannot parse html with a regex (entertaining answer), though just finding all occurances of one tag should be dooable.

once you figure out the proper regex re-seq is the function you want to use:

user> (re-find #"aa" "aalkjkljaa")
"aa"
user> (re-seq #"aa" "aalkjkljaa")
("aa" "aa")

this is not crashing for you because re-find is returning nil which map is interpreting as an empty list and doing nothing

edited May 23 '17 at 12:28

Community

1
1

answered Jun 04 '12 at 22:28

Arthur Ulfeldt

90,827
27
201
284

Well... not so doable if you wanted to do it right. Want to exclude non-XML text quoted as CDATA? Want to exclude tags which belong to a different namespace? Etc. – Charles Duffy Jun 05 '12 at 02:23
you are completely correct: I highly recommend the linked answer on this topic :) "the center cannot hold..." – Arthur Ulfeldt Jun 05 '12 at 02:38

score 3 · Accepted Answer · answered Jun 04 '12 at 22:36

3

This really looks like an HTML scraping problem in which case, I would advise using enlive.

Something like this should work

(ns test.foo
  (:require [net.cgrand.enlive-html :as html]))

(let [url (html/html-resource
           (java.net.URL. "http://www.nytimes.com"))]
  (map #(-> % :attrs :href) (html/select url [:a])))

answered Jun 04 '12 at 22:36

Julien Chastang

17,592
12
63
89

Thanks for the answer! It seems to be the most "elegant" one. – Deleteman Jun 05 '12 at 13:57

score 2 · Answer 3 · answered Jun 05 '12 at 02:02

2

I don't think there is anything wrong with your code. Perhapsstr_responseis the suspect. The following works with http://google.com with your regex:

(let [str_response (slurp "http://google.com")]
  (map #(println (str "Match: " %)) 
   (re-seq #"(?sm)href=\"([a-zA-Z.:/]+)\"" str_response))

Note ref-find also works though it only returns one match.

answered Jun 05 '12 at 02:02

jbear

363
2
7

Thanks for the answer, for some reason, that code inside my project didn't print anything, I've decided to go with Julien's solution anyways. Thanks for taking the time! – Deleteman Jun 05 '12 at 13:58
You're welcome. As far as parsing html is concerned Chris Grand's enlive is the way to go. – jbear Jun 05 '12 at 22:52

Getting all matches for a regexp on clojure

3 Answers3