0

I have a regex which matches to a standard html structure:

<(.*)html(.*)>(.*)<head(.*)>(.*)</head>(.*)<body(.*)>(.*)<body(.*)>(.*)</body>

which works fine for my node.js / express / jade generated sites.

However, if I try to match the following website, I got no match:

<HTML><HEAD>
<TITLE>IPWEBS - 400 Bad Request</TITLE>
</HEAD>
<BODY><H2>400 Bad Request</H2>
<P>The request generated an error response.</P>
</BODY>
</HTML>

Any idea where I've gone wrong? Case sensitivity is not the problem , I've already checked that.

UPDATE: Still with the following updated regex no match:

/i<(.*)html(.*)>(.*)<head(.*)>(.*)</head>(.*)<body(.*)>(.*)</body>(.*)</html>

(Sorry, tested the new regex already, but during trial with upper case a did some copy/paste errors ;))

COMMENT: I just want to test basic availability and correct html structure with jasmine-node under node.js. I don´t want to parse the DOM or walk through. If anyone has a better idea i´m really happy for suggestions.

solick
  • 2,325
  • 3
  • 17
  • 29
  • You're opening 2 body tags, closing one, and missing the closing html tag... – RemyG Jul 07 '14 at 14:01
  • And in case you didn't know: http://stackoverflow.com/a/1732454/3620171 – mike_m Jul 07 '14 at 14:04
  • use an HTML parser this isn't really an ideal task for REGEX – abc123 Jul 07 '14 at 14:05
  • @mike_m I just want to check the basic structure, i´m using this with jasmine tests under node.js. I don´t want to parse or walk through the DOM. – solick Jul 07 '14 at 14:07
  • Another issue may be the dot metacharacter not matching newlines. So you may need to add the "s" flag (single line mode) to cause the dot to also match newline characters. – bloodyKnuckles Jul 07 '14 at 14:10
  • Do not forget to use Singleline option, because dot normally doesn't match new line character `'\r'` IIRC – Alex Zhukovskiy Jul 07 '14 at 14:12
  • 1
    @AlexJoukovsky: The JavaScript regex flavor doesn't provide a Singleline/DOTALL mode. The most common workaround is to use `[\s\S]*` instead of `.*`. And the set of characters `.` doesn't match is `[\r\n\u2028\u2029]`. In JavaScript, that is; it varies from one flavor to the next. – Alan Moore Jul 07 '14 at 15:01

2 Answers2

0

I'd say your Regex is wrong, it should be:

<(.*)html(.*)>(.*)<head(.*)>(.*)</head>(.*)<body(.*)>(.*)</body>(.*)</html>
RemyG
  • 486
  • 5
  • 11
0

You have redundant body expression:

<(.*)html(.*)>(.*)<head(.*)>(.*)</head>(.*)<body(.*)>(.*)<body(.*)>(.*)</body>

seems should be

<(.*)html(.*)>(.*)<head(.*)>(.*)</head>(.*)<body(.*)>(.*)</body>

You can append (.*)</html> for the completeness and add /i option to ignore cases.

However, the usage of ( and ) raises some question. If you just want to test whether your HTML strings pass-or-not your regex, the following regex will do the same:

<html.*>.*<head.*>.*</head>.*<body.*>.*</body>.*</html>
Kita
  • 2,604
  • 19
  • 25