0

I made a regular expression that matches for title="..." in <a>; unfortunately, it also matches for title="..." in <img/>.

Is there a way to tell the regular expression to ONLY look for title="..." in <a>? I can't use a look-behind method like (?<=<a\s+) because they're NOT supported in JavaScript.

Here's my expression:

/((title=".+")(?=\s*href))|(title=".+")/igm;

The above expression matches the following:

enter image description here

As you can see, it matches for title="..." found in <img/>; I need the expression to exclude titles found in image tags.

Here is the link to the RegExp.


Also, if possible, I need to get rid of the title=" " around the title. So, only return title AFTER href and title BEFORE href. If not possible, I guess I can use .replace() and replace it with "".


zx81's expression:

enter image description here

Matthew
  • 2,158
  • 7
  • 30
  • 52
  • Every time I see someone use regular expressions to match strings in a non-regular language I die a little inside. – Aadit M Shah Jun 30 '14 at 04:03
  • @AaditMShah - This is in a textarea, it's not in the DOM as HTML elements on a webpage. Therefore, that principle doesn't apply, right? The user types in the HTML, so basically, it's just text with greater and less than signs that appear to be HTML – Matthew Jun 30 '14 at 04:08
  • 1
    @Matthew It's possible to convert a string into DOM objects, though. See http://stackoverflow.com/questions/494143/creating-a-new-dom-element-from-an-html-string-using-built-in-dom-methods-or-pro – Kemal Fadillah Jun 30 '14 at 04:20
  • If the user enters the HTML in a textarea, then.. *valid HTML* is single quotes, double quotes and *no quotes* (for attributes/properties without spaces). The order of title vs href (or other attributes/properties like id, class, name, alt, etc.) usually also doesn't matter.. (Something to be aware of if and when to instruct your users to use a specific formatting/order so your app considers it valid html or performs the intended operation). So.. user entering html is actually the worst possible scenario to regex! – GitaarLAB Jun 30 '14 at 05:11
  • @GitaarLAB I'm making a HTML to markdown convertor – Matthew Jun 30 '14 at 05:14
  • 1
    That's going to eat a big chunk of your life (without better solutions/library's). With the best possible intentions (knowing what you are trying to do): might I suggest to google a little further on proper html-parsing libraries? I can't suggest one off the top of my head, sorry, but I am absolutely positive there are some solid working libraries and solutions. – GitaarLAB Jun 30 '14 at 05:21
  • Whether you write HTML in a textarea or display it as a page, HTML is still not a regular language. You can't parse non-regular languages using regular expressions. You need a full-blown HTML parser. – Aadit M Shah Jun 30 '14 at 06:18

3 Answers3

2

First of all, you must know that most people prefer to parse html with a DOM parser, as regex can present certain hazards. That being said, for this straightforward task (no nesting), here is what you can do in regex.

Use Capture Groups

We don't have lookbehinds or \K in JavaScript, but we can capture what we like to a capture group, then retrieve the match from that group, ignoring the rest.

This regex captures the title to Group 1:

<a [^>]*?(title="[^"]*")

On the demo, look at the Group 1 captures in the right pane: that's what we are interested in.

Sample JavaScript Code

var unique_results = []; 
var yourString = 'your_test_string'
var myregex = /<a [^>]*?(title="[^"]*")/g;
var thematch = myregex.exec(yourString);
while (thematch != null) {
    // is it unique?
    if(unique_results.indexOf(thematch[1]) <0) {
        // add it to array of unique results
        unique_results.push(thematch[1]);
        document.write(thematch[1],"<br />");    
    }
    // match the next one
    thematch = myregex.exec(yourString);
}

Explanation

  • <a matches the beginning of the tag
  • [^>]*? lazily matches any chars that are not a >, up to...
  • ( capture group
  • title=" literal chars
  • [^"]* any chars that are not a quote
  • " closing quote
  • ) end Group 1
zx81
  • 41,100
  • 9
  • 89
  • 105
  • 1
    Added explanation... Let me know if you have questions or need any tweaks. :) – zx81 Jun 30 '14 at 02:54
  • Thank you for helping me out! I updated my question w/ a screenshot of the regular expression you gave me, and it doesn't seem to work correctly? Am I implementing it wrong? – Matthew Jun 30 '14 at 03:43
  • 1
    `Am I implementing it wrong?` I'm assuming you want to implement it in code, right? This screenshot just shows the entire match. But we're only interested in Group 1, which the code returns. This is a technique for JS as it doesn't have lookarounds. On [this demo](http://regex101.com/r/aL5dV5/1), you can see the change of color for Group 1, and see the titles in the pane on the right. That's what the code will give you. :) – zx81 Jun 30 '14 at 03:46
  • What about [this](http://regex101.com/r/aL5dV5/3) demo? It only returns one match. How about when the title comes before? – Matthew Jun 30 '14 at 04:11
  • In regex101 you need to add the `g` flag to allow multiple matches: [see this demo](http://regex101.com/r/aL5dV5/4) ` How about when the title comes before?` What do you mean? Not understanding. – zx81 Jun 30 '14 at 04:20
1

I am not sure if you can do this with a single regular expression in JavaScript; however, you could do something like this:

http://jsfiddle.net/KYfKT/1/

var str = '\
<a href="www.google.com" title="some title">\
<a href="www.google.com" title="some other title">\
<a href="www.google.com">\
<img href="www.google.com" title="some title">\
';

var matches = [];
//-- somewhat hacky use of .replace() in order to utilize the callback on each <a> tag
str.replace(/\<a[^\>]+\>/g, function (match) {
    //-- if the <a> tag includes a title, push it onto matches
    var title = match.match(/((title=".+")(?=\s*href))|(title=".+")/igm);
    title && matches.push(title[0].substr(7, title[0].length - 8));
});

document.body.innerText = JSON.stringify(matches);

You should probably utilize the DOM for this, rather than regular expressions:

http://jsfiddle.net/KYfKT/3/

var str = '\
<a href="www.google.com" title="some title">Some Text</a>\
<a href="www.google.com" title="some other title">Some Text</a>\
<a href="www.google.com">Some Text</a>\
<img href="www.google.com" title="some title"/>\
';

var div = document.createElement('div');
div.innerHTML = str;
var titles = Array.apply(this, div.querySelectorAll('a[title]')).map(function (item) { return item.title; });

document.body.innerText = titles;
Robert Messerle
  • 3,022
  • 14
  • 18
1

I'm not sure where your html-sources come from, but I do know some browsers do not respect the casing (or attribute-order) of source when fetched as 'innerHTML'.

Also, both authors and browsers can use single and double quotes.
These are the most common 2 cross-browser pitfalls that I know of.

Thus, you could try: /<a [^>]*?title=(['"])([^\1]*?)\1/gi

It performs a non-greedy case-insensitive search using back-references to solve the case of single vs double quotes.

The first part is already explained by zx81's answer. \1 matches the first capturing group, thus it matches the used opening quote. Now the second capturing group should contain the bare title-string.

A simple example:

var rxp=/<a [^>]*?title=(['"])([^\1]*?)\1/gi
,   res=[]
,   tmp
;

while( tmp=rxp.exec(str) ){  // str is your string
  res.push( tmp[2] );        //example of adding the strings to an array.
}

However as pointed out by others, it really is bad (in general) to regex tag-soup (aka HTML). Robert Messerle's alternative (using the DOM) is preferable!

Warning (I almost forgot)..
IE6 (and others?) has this nice 'memory-saving feature' to conveniently delete all unneeded quotes (for strings that don't need spaces). So, there, this regex (and zx81's) would fail, since they rely on the use of quotes!!!! Back to the drawing-board.. (a seemingly never-ending process when regexing HTML).

GitaarLAB
  • 14,536
  • 11
  • 60
  • 80