Regular expression to get link text

Question

I'm stumped! I've googled and read and read and read and I'm sure there is something really dumb that I'm doing wrong. This is from a Greasemonkey script that I can't for the life of me get to initiate AND perform correctly. I'm trying to match this:

<a href="/browse/post/SOMETHING/">**SOMETHING** (1111)</a>

Here's what I'm using:

var titleRegex = new RegExp("<a href=\"/browse/post/\d*/\">(.*) \(");

I'm sure I'm missing some kind of escape characters? But I just can't figure it out so that Firefox doesn't error out.

I generate the regexp using http://regexpal.com/ -- In Firefox error console I receive "unterminated parenthetical"

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — asawyer, Dec 27 '11 at 21:38
for ease of reading I always prefer literal regex, e.g. `"here is a string".match(/match me/i)` — tomfumb, Dec 27 '11 at 21:48
I'd be curious to learn more about using an XML parser to accomplish something like this. I'm basically trying to modify an existing script to accomplish what I need it to do -- do you have a good example of a greasemonkey script that does things like this the **right** way? — spazzed, Dec 27 '11 at 22:03

score 5 · Accepted Answer · answered Dec 27 '11 at 21:40

5

When building a regex from a string instead of a regex literal, you need to double the backslashes.

Then, \d* only matches digits. I'm assuming that SOMETHING is just a placeholder, but if that were to contain anything but digits, it would fail.

Also, you should be using (.*?) (lazy) instead of (.*) (greedy), or you might be matching too much. Perhaps ([^(]*) would be even better.

Hard to say, though, without knowing more about the actual text you're trying to match.

All in all:

var titleRegex = new RegExp("<a href=\"/browse/post/\\d*/\">([^(]*) \\(");

answered Dec 27 '11 at 21:40

Tim Pietzcker

328,213
58
503
561

This seems to work perfectly. I'm still confused -- the first quotation mark in the string only requires a single backslash but the "(" at the end requires a double? What is the reason for this? – spazzed Dec 27 '11 at 21:58
1

`\"` escapes the quote character so you can use it in a string. \\ escapes the backslash so you can use it in a regex where `\(` escapes the parenthesis so it matches a literal `(` instead of opening a capturing group. – Tim Pietzcker Dec 27 '11 at 22:05
Because the first quotation mark in the string is escaped so the JavaScript interprets it as a quotation mark within the string literal. Regular expressions are happy to accept quotation marks, so it doesn't need to be escaped within the regex. The "(" at the end needs to be escaped within the regex, not the string, so you need the JavaScript string to contain "\\(", but JavaScript eats a backslash character, so to get the string to contain that you need "\\\(" which javascript turns into a string that contains "\\(" and feeds to the regex. – Mike Edwards Dec 27 '11 at 22:06

score 2 · Answer 2 · answered Dec 27 '11 at 21:39

2

Here's a simple fix:

/href=\".*?\">(.*?)\(/

answered Dec 27 '11 at 21:39

imsky

3,239
17
16

Douglas · Answer 3 · 2011-12-27T22:50:37.617

The general idea is to take a string of HTML, parse it into a document (a tree of dom elements) then traverse it to extract information.

If the link was:

<a href="/browse/post/something/"><b>something</b> else</a>

First traverse the tree to find the anchor tag, then:

anchor.textContent // returns "something else"

It is simple to extract the text from an element, even when there are other elements in the tree below which also contain text. This is also more robust than the regex example. Say someone added a class attribute to the anchor, then the regex in the accepted answer would no-longer match the anchor tag. But a traversal based solution would still work.

In the simple case, you can create a div, then set the innerHTML to your HTML string, then traverse it:

var html = '<p><a href="/browse/post/">Lorem</p> <p>Ipsum</p></a>';
var div = document.createElement("div");
div.innerHTML = html;
var anchors = div.getElementsByTagName("a");
for (var i = 0; i < anchors.length; i++) {
    console.log(anchors[i].textContent);
}

A more sophisticated version of this is packaged in the jQuery(string) function.

var html = '<div><p><a href="/browse/post/">Lorem</p> <p>Ipsum</p></a></div>';
jQuery(html).find("a").each(function() {
    console.log(jQuery(this).text());
});

Live example: http://jsfiddle.net/ygcFM/

Great response. Time for me to pick up a book on jquery and the DOM to try and learn this stuff. My javascript is "novice" at best. Also -- double thanks for the jsfiddle.net link! I've not seen that before.... great tool!! — spazzed, Dec 29 '11 at 20:36

Regular expression to get link text

3 Answers3