Regular expression to extract text from a string in html format

Question

I am currently getting response error in html format. It is of type string.

"<!DOCTYPE html>\r\n
<html>
  <head>
    <title>Data already exists</title>
  </head>
</html>"

I wanted to retrieve the content inside the <title>, for above instance "Data already exists". Can anybody suggest a appropriate regular expression to capture that text.

Please any help is appreciated!

I really appreciate everyone's suggestion and thanks for taking time to share the knowledge. You guys are awesome. — inspiringmyself, Aug 29 '12 at 14:07

João Silva · Accepted Answer · 2012-08-29T01:25:33.213

5

First, you can do it without regex, by creating a dummy element to inject the HTML:

var s = "your_html_string";
var dummy = document.createElement("div");
dummy.innerHTML = s;
var title = dummy.getElementsByTagName("title")[0].innerText;

_{But if you really insist on using regex:}

var s = "your_html_string";
var title = s.match(/<title>([^<]+)<\/title>/)[1];

Here's a DEMO illustrating both approaches.

edited Aug 29 '12 at 01:25

answered Aug 29 '12 at 01:19

João Silva

89,303
29
152
158

You don't need to use *getElementsByTagName*, there is a [document.title](http://dev.w3.org/html5/spec/single-page.html#document.title) property that is more convenient. Also, the [title element](http://dev.w3.org/html5/spec/single-page.html#the-title-element) can have attributes, so the regular expression needs to be more sophisticated (parsing HTML with a regular expression is generally a bad idea). – RobG Aug 29 '12 at 01:40
@RobG: I absolutely agree that parsing HTML with a regex is generally a bad idea; however, OP explicitly said that it it was a response error that follows the above format. `document.title` will get the current document's title. Note that OP is no trying to parse the current document but a specific response message (probably from an `ajax` call). – João Silva Aug 29 '12 at 01:46
1

Hmm... One line of regex, or three lines of dummy element manipulation? One or three? I know which I'd choose. (I too agree that in a general sense parsing HTML with regex is not the way to go, but as you said João, for a specific case with a known format I think it is OK.) – nnnnnn Aug 29 '12 at 02:06
Yes, all good. The OP could use the response text to create a new document, then just use *document.title*. – RobG Aug 29 '12 at 02:10
I really appreciate everyone's suggestion and thanks for taking time to share the knowledge. You guys are awesome. – inspiringmyself Aug 29 '12 at 14:07

elclanrs · Answer 2 · 2012-09-15T10:08:59.153

2

The very basics of parsing html tags in regex is this. http://jsbin.com/oqivup/1/edit

var text = /<(title)>(.+)<\/\1>/.exec(html).pop();

But for more complicated stuff I would consider using a proper parser.

edited Sep 15 '12 at 10:08

answered Aug 29 '12 at 01:25

elclanrs

92,861
21
134
171

Given the response is already a string can't you skip the jQuery line? – nnnnnn Aug 29 '12 at 02:02
I really appreciate everyone's suggestion and thanks for taking time to share the knowledge. You guys are awesome. – inspiringmyself Aug 29 '12 at 14:08

score 1 · Answer 3 · answered Aug 29 '12 at 01:27

1

You could parse it using DOMParser():

var parser=new DOMParser(),
    doc=parser.parseFromString("<!DOCTYPE html><html><head><title>Data already exists</title></head></html>","text/html");

doc.title; /* "Data already exists" */

answered Aug 29 '12 at 01:27

Oriol

274,082
63
437
513

You probably need to use an `ActiveXObject` for IE < 9. – João Silva Aug 29 '12 at 01:32
and how we can use the `doc` variable with jquery? – Dariush Jafari Aug 29 '12 at 01:32
@DariushJafari Do you mean `$(doc)`? – Oriol Aug 29 '12 at 01:33
Chrome 23 Canary doesn't parse HTML with `DOMParser` though. If the HTML string is XML-valid, you can always use the `application/xml` parsing for cross-browser parsing. – Fabrício Matté Aug 29 '12 at 01:34
@Oriol how do you select some elements of `doc`? `$('div.cc')` selects the current document elements. – Dariush Jafari Aug 29 '12 at 01:51
@DariushJafari Sorry I can't help you more, but I'm not an expert. In fact, I have never used that function, but I knew it and I thought it could be a good solution to your problem. If you want to know more, you should ask it on another question, sorry. – Oriol Aug 29 '12 at 01:54
Cool, but that will go belly–up in IE 9 and lower (maybe 10 too). I guess the code is from the [MDN DOM Parser](https://developer.mozilla.org/en-US/docs/DOM/DOMParser) article, which also has a more general solution. – RobG Aug 29 '12 at 02:34
@RobG No, the code is from w3schools (http://www.w3schools.com/dom/dom_loadxmldoc.asp). But your link is great. Does "DOMParser HTML extension for other browsers" work for all browsers? It says that "text/html parsing is natively supported", but in "Browser compatibility" table it seems that only works on Firefox... – Oriol Aug 29 '12 at 02:44
The MDN code doesn't work in IE (can't set innerHTML of HTML element and IE doesn't support text/html with `parseFromString`), see my answer to [How to create Document objects with JavaScript](http://stackoverflow.com/questions/8227612/how-to-create-document-objects-with-javascript/12172023#12172023). W3Schools is very ordinary, much better to reference appropriate specifications with MDN and MSDN for examples. – RobG Aug 29 '12 at 06:39
I really appreciate everyone's suggestion and thanks for taking time to share the knowledge. You guys are awesome. – inspiringmyself Aug 29 '12 at 14:09

Regular expression to extract text from a string in html format

3 Answers3