17

I'm trying to test a string for a basic html pattern and although I use the m (multiline) modifier it only works when the string is a 1-liner

(re-find #"(?im)^<html>.*<body>.*</body>.*</html>" c))

Fails:

"<html>   <body>   sad   </body> 
     </html>"

Works:

"<html>   <body>   sad   </body>      </html>"

What am I doing wrong?

R X
  • 281
  • 2
  • 11
  • 2
    I'll just leave it here http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – mobyte Feb 22 '13 at 10:06

2 Answers2

21

Disclaimer: I'm not a Clojure programmer, but I think this problem is independent of the language.

When multi-line mode is enabled, the interpretation of the caret ^ and the dollar $ changes like this: Instead of matching the beginning and end of the entire input string, they match the beginning and the end of each line in the input string. This is - as far as I can see - not what you want/need.

What you want is for your .*s to match newlines (what they don't do by default) and this can be done by enabling the single-line mode (aka dot-all mode). So this means:

(re-find #"(?is)^<html>.*<body>.*</body>.*</html>" c))

You can also verify this on RegExr.

zb226
  • 9,586
  • 6
  • 49
  • 79
15

You need to use the (?s) "dotall mode" switch.

Example:

user=> (re-find #"\d{3}.\d{3}" "123\n456")    
nil

user=> (re-find #"(?s)\d{3}.\d{3}" "123\n456")
"123\n456"

The (?m) switch is deceptively named -- it changes what the ^ and $ anchors do, allowing them to also match start-of-line and end-of-line, respectively -- which is not want you want.

Matt Fenwick
  • 48,199
  • 22
  • 128
  • 192
  • 1
    Thanks Matt! Other might find http://nakkaya.com/2009/10/25/regular-expressions-in-clojure/ useful, too. – David J. Nov 28 '13 at 21:19