Regex for remove all string except an range of specific characters

Question

I need your help to remove all characters using a Javascript Regex in string HTML Document except <body></body> and whole string inside body tag.

I tried to use this but doesn't work:

var str = "<html><head><title></title></head><body>my content</body></html>"
str.replace(/[^\<body\>(.+)\<\\body\>]+/g,'');

I need the body content only, other option will be to use DOMParser:

var oParser = new DOMParser(str);
var oDOM = oParser.parseFromString(str, "text/xml");

But this throws an error parsing my string document loaded via Ajax.
Thanks in advance for your suggestions!

1.. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags @Marty — d'alar'cop, Aug 22 '14 at 02:43
@Marty Would you really recommend an HTML parser for something as simple as this? This is not an HTML parsing question; it's a simple matching question. — tckmn, Aug 22 '14 at 02:45
@Doorknob Actually that is exactly the reason I mentioned that those comments were coming rather than making it myself. — Marty, Aug 22 '14 at 02:46
@joseluisq: The trouble with all the answers that are using `.*?` is that it doesn't match newline characters. In place of that, use `[\s\S]*?` — , Aug 22 '14 at 03:01
@squint yes, I tried to use [\s\S]*? and works fine thanks ! — joseluisq, Aug 22 '14 at 03:05

Tim.Tang · Accepted Answer · 2014-08-22T03:05:30.187

1

var str = "<html><head><title></title></head><body>my content</body></html>"

str=str.match(/<(body)>[\s\S]*?<\/\1>/gi);

//also you can try this:
//str=str.match(/<(body)>.*?<\/\1>/gis);

Regular expression visualization

Debuggex Demo

edited Aug 22 '14 at 03:05

answered Aug 22 '14 at 02:43

Tim.Tang

3,158
1
15
18

1

@joseluisq see my updates: `str=str.match(/<(body)>[\s\S]*?<\/\1>/gi);` http://regex101.com/r/eJ6sG4/3 – Tim.Tang Aug 22 '14 at 02:55

Avinash Raj · Answer 2 · 2014-08-22T02:48:43.427

1

You could try this code,

> var str = "<html><head><title></title></head><body>my content</body></html>"
undefined
> str.replace(/.*?(<body>.*?<\/body>).*/g, '$1');
'<body>my content</body>'

DEMO

edited Aug 22 '14 at 02:48

answered Aug 22 '14 at 02:43

Avinash Raj

172,303
28
230
274

1

@joseluisq try `str.replace(/[\S\s]*?(.*?<\/body>)[\S\s]*/gm, '$1');` – Avinash Raj Aug 22 '14 at 02:58

tckmn · Answer 3 · 2014-08-22T03:08:26.290

0

You can't (or at least shouldn't) do this with replace; try match instead:

var str = "<html><head><title></title></head><body>my content</body></html>"
var m = str.match(/<body>.*<\/body>/);
console.log(m[0]); //=> "<body>my content</body>"

If you have a multiline string, change the . (which does not include \n) to [\S\s] (not whitespace OR whitespace) or something similar.

edited Aug 22 '14 at 03:08

answered Aug 22 '14 at 02:43

tckmn

57,719
27
114
156

Ok, so how can I do when the string html is more complex http://regex101.com/r/eJ6sG4/1 ? – joseluisq Aug 22 '14 at 02:56
1

@joseluisq That string has newlines; you'll need the `s` flag (dotall) for that – tckmn Aug 22 '14 at 02:57
@Doorknob there isn't a `s` modifier in js. – Avinash Raj Aug 22 '14 at 02:59
@AvinashRaj Whoops, that's what I get for answering JS questions while coding in Python. :P I'll edit the answer – tckmn Aug 22 '14 at 03:07

Regex for remove all string except an range of specific characters

3 Answers3