what's wrong with my regex expression?

Question

I have a string trackingObj which stores a collection of information.(HTML codes, but there are many characters like \r\n inside the codes, for example: <div id=\"MainBox\">\r\n

I'd like to get following content from that giant string:

<td id='theTrackInfo'><strong><span id='HeaderNum'>aaa</span><span id='HeaderFrom'> <br>bbb</span><span id='HeaderDes'> <br>ccc</span><span id='HeaderItem'> <br>ddd</span><span id='HeaderState'> <br>eee</span><span id='HeaderADate'><br>fff</span><span id='HeaderSign'><br>ggg</span><DIV id='HeaderExtra'> </DIV></strong></td>

I tried to append the whole string to DOM using html() but there are illegal characters inside it, so I couldn't use jQuery to perform DOM manipulation.

just thinking about using pure regex to get what I need. I tried following:

var Info = new RegExp("<td>\sid='theTrackInfo'>[\s\S]*?\/td>", "g");
var InfoHtml = theTrackInfo.exec(trackingObj);
console.log(InfoHtml);

I also tried:

var InfoHtml = trackingObj.match(/<td>\sid='theTrackInfo'>[\s\S]*?<\/td>/gi);
console.log(InfoHtml);

doesn't work. What I am missing?

=================UPDATE==========================

Hi everyone, thank you for all of your answers.

I tried using DOMParser to make it work finally:

var parser = new DOMParser();
var html = parser.parseFromString(ProcessedStrings,"text/html");
var info = $(html).find("#theTrackInfo");
console.log($(info).html());

Some one may say jQuery should do same thing. the problem is that, the trackingObj is retrieved by using ajax call and when I try to use jQuery append method it append to DOM, in console, it said: "Unexpected token ILLEGAL"

But I will still choose a regex answer as correct answer for this question.

==================update 2============

Hi, I examined Tom Fenech's approach, it works for me too. probably the error is caused by trying to append the codes to a div. nothing to do with jQuery itself.

Try to follow Daniel's advice, but in a quick look at your RE I saw a `>` at the end of the ` — Tomás Cot, Aug 05 '14 at 15:05
`\r` and `\n` are perfectly valid HTML--HTML ignores whitespace unless explicitly told not to. I think what you mean is that your regex doesn't understand that the contents should be treated as a single string. Look at this post for information on the "dotall" modifier: http://stackoverflow.com/questions/1068280/javascript-regex-multiline-flag-doesnt-work — Palpatim, Aug 05 '14 at 15:07
Second Daniel's comment + I'd opt to write my regex as a literal, `new RegExp("\s")` is not the same as `/\s/`: escaping is required. Anyway: parsing markup is the way to go [here's a crude example of how to do it](http://jsfiddle.net/d6cgf/) — Elias Van Ootegem, Aug 05 '14 at 15:11
There is no need to append the content to the DOM if you just want to extract some data from it. Using a parser is the way to go. If DOMParser can do it, then I'd be very surprised if `$.parseHTML` can't. If you've tried my answer and are having trouble with it, please let me know. — Tom Fenech, Aug 05 '14 at 16:18

score 1 · Accepted Answer · answered Aug 05 '14 at 15:32

1

If you're already using jQuery, you can just parse your string as HTML then extract the part that you're interested in:

var trackingObj = "<table><tbody><tr><td id='theTrackInfo'><strong><span id='HeaderNum'>aaa</span><span id='HeaderFrom'> <br>bbb</span><span id='HeaderDes'> <br>ccc</span><span id='HeaderItem'> <br>ddd</span><span id='HeaderState'> <br>eee</span><span id='HeaderADate'><br>fff</span><span id='HeaderSign'><br>ggg</span><DIV id='HeaderExtra'> </DIV></strong></td><tr></tbody><table>";

var html = $.parseHTML(trackingObj);
var td = $(html).find('#theTrackInfo').get()[0]; // get native DOM element
console.log(td.outerHTML);

answered Aug 05 '14 at 15:32

Tom Fenech

72,334
12
107
141

OP mentions he can't use jQ in his question, that might be why the question wasn't tagged jQuery, either – Elias Van Ootegem Aug 05 '14 at 15:39
I interpreted that differently; I thought that they were saying that the approach that they tried didn't work, not that they couldn't use jQuery at all. – Tom Fenech Aug 05 '14 at 15:41
Hi Tom, why do we put get()[0] at the end? – JavaScripter Aug 06 '14 at 12:30
@JavaScripter because `find()` returns a jQuery object even if only one element is matched. [`get()`](http://api.jquery.com/get/) returns the list of native DOM elements corresponding to the jQuery object and we want the first one (index `[0]`). You can also use `.get(0)` if you prefer. I remember reading somewhere that it's about 1 nanosecond slower though. Pick whichever you think is more readable :) – Tom Fenech Aug 06 '14 at 12:48

score 0 · Answer 2 · answered Aug 05 '14 at 15:13

0

Couple of issues:

First, the "id" is inside the opening tag... so the attempt you made should be:

var Info = new RegExp("<td\sid='theTrackInfo'>[\s\S]*?<\/td>", "g");

Secondly, it won't get the correct data if you have another table embedded in that .

I would look at using a tool like Html Agility Pack to get what you are looking for.

answered Aug 05 '14 at 15:13

Jason Ellingson

46
3

Invalid regex + why recommend a third party tool to somebody who is using a language that, 99% of the time, has a DOM API at the ready? – Elias Van Ootegem Aug 05 '14 at 15:15
I guess I don't understand? The regex works on Regex Tester. It correctly grabs both the opening and closing TD tags. Also, only suggested a 3rd party solution because of the complex nature of unvetted HTML code. He didn't specify that the code was from a valid DOM source. – Jason Ellingson Aug 05 '14 at 15:33
passing a string constant to the `RegExp` constructor requires any escaping `\` chars to be escaped: `new RegExp('foo\s')` is not the same as `/foo\s/`. It should be `new RegExp('foo\\s')`. Parsing markup doesn't require a valid DOM: parsing a fragment as XML is just fine, or do what jQ does: create an element, set its contents as to the markup you want to parse, and perform the dom operations there. Easy – Elias Van Ootegem Aug 05 '14 at 15:39
Ahh. I see what you mean. I was just trying to point out the extra ">" after the initial TD (as others are also now pointing out as well). I wasn't going for more than that. And as for fragments of XML, I agree that jQuery can do a lot of that kind of work (and very well), I just find that it can easily allow for unexpected results as you have little control over how strict it wants to be. Outside tools offer options like assuming appropriate closing tags, or not. Good discussion. Appreciate your input. – Jason Ellingson Aug 05 '14 at 15:50
thank you Jason for your explanation. Now I learnt how to use Regex correctly. – JavaScripter Aug 05 '14 at 16:23
@JavaScripter: It looks as if you are still using regex to parse markup. That isn't using regex correctly – Elias Van Ootegem Aug 06 '14 at 05:22

what's wrong with my regex expression?

2 Answers2