RegEx to match text to first occurence of a delimiter

Question

This is the data I want to match with RegEx:

<table>
  <tr>
    <td>
      <font size="4">Speciality</font>
    </td>
    <td>
      <font size="4">somespeciality</font>
    </td>
  </tr>
  <tr>
    <td>
      <font size="4">Date</font>
    </td>
    <td>
      <font size="4">somedate</font>
    </td>
  </tr>
</table>

I want to get as a result somespeciality but with this RexEx:

/Speciality[\s\S]*size="4">(.*?)<\/font>/i

I'm getting somedate. What is the correct way to do this?

Thanks.

Incidentally, and unrelated to your problem, the [`` element](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/font) is deprecated and obsolete. Please, use css instead. — David Thomas, Feb 18 '15 at 00:04
Please see [*RegEx match open tags…*](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). — RobG, Feb 18 '15 at 00:10
@DavidThomas actually I'm dealing with data on an oldy website trying to get them then store them in a SQLite database. — pr.nizar, Feb 18 '15 at 00:29
@RobG Yep this post crossed my path a few times.. lol but really I'm compelled to do so.. DOM traversal and other techniques would be more complicated to implement for me. :-) — pr.nizar, Feb 18 '15 at 00:29

score 1 · Accepted Answer · answered Feb 17 '15 at 23:58

1

You need to use a non-greedy quantifier after your character class.

[\s\S]*?

answered Feb 17 '15 at 23:58

hwnd

69,796
4
95
132

score 1 · Answer 2 · answered Feb 19 '15 at 00:29

Just for the record, if you did want to do this with plain DOM methods, you'd do something like the following. It gets all the elements, finds the first one with text content that matches the text, gets it's tagname, then finds the next element with that tag name and returns the text content:

var data = '<table><tr><td><font size="4">Speciality</font></td>' +
           '<td><font size="4">somespeciality</font></td></tr>' +
           '<tr><td><font size="4">Date</font></td><td><font size="4">' +
           'somedate</font></td></tr></table>';

function getSpecial(text, data) {
  var div = document.createElement('div');
  div.innerHTML = data;
  var tagName;

  var nodes = div.getElementsByTagName('*');

  for (var i=0, iLen=nodes.length; i<iLen; i++) {
    if (tagName && nodes[i].tagName == tagName) {
      return nodes[i].textContent;
    }

    if (nodes[i].textContent.trim() == text) {
      tagName = nodes[i].tagName;
    }
  }
}

console.log(getSpecial('Speciality', data)); // somespeciality

The difficulty with any such approach (including using a regular expression) is that any change to the markup (and resulting DOM) will likely cause the process to fail.

Note that the above requires ES5 and support for textContent, which should be all modern browsers and IE 9+. Support for older browsers can be added by adding a polyfill for trim and using nodes[i].textContent || nodes[i].innerText. The rest will be fine.

Thank you! That's a correct answer to my problem too. But regarding performance will this have same performance? (I'm loading with ajax over 12000 pages one by one matching 15 strings.. with This code http://pastebin.com/raw.php?i=57kPXWNA with XSS in a webkit) — pr.nizar, Feb 19 '15 at 09:53

RegEx to match text to first occurence of a delimiter

2 Answers2