Regex help - include, but not show

Question

I have a this.responseText, which is messy. Trying to separate the ones that I need:

Here's the text:

<html>
<head><title>Index of /browserify-view/build/source/pic/</title></head>
<body bgcolor="white">
<h1>Index of /browserify-view/build/source/pic/</h1><hr><pre><a href="../">../</a>
<a href="wd0c9af04bbf54efc9a2f7ba766a6694f2421b1dc.png">wd0c9af04bbf54efc9a2f7ba766a6694f2421b1dc..&gt;</a> 22-Jul-2016 22:29               65180
<a href="thumb-wd20f381801bb51.png">thumb-wd20f381801bb51.png;</a> 22-Jul-2016 22:33               10779
</pre><hr></body>
</html>

How can I separate like this:

wd0c9af04bbf54efc9a2f7ba766a6694f2421b1dc.png

thumb-wd20f381801bb51.png

^(?=.*(?:a href|.png|...)) – AmazingDayToday Aug 23 '16 at 18:38 — AmazingDayToday, Aug 23 '16 at 18:38
Why am I getting minuses? What is wrong? – AmazingDayToday Aug 23 '16 at 18:47 — AmazingDayToday, Aug 23 '16 at 18:47

score 0 · Answer 1 · edited May 23 '17 at 11:54

This is by far one of the best responses I've seen on this topic: RegEx match open tags except XHTML self-contained tags

If you are trying to do something quick, I would look to something like this (python):

<a[^>]+href="(?P<x>[^"]+)">

Just note, its bad practice and if this is going to be executed on a larger scale (anything besides just THIS html) I would recommend an html parser. It will save a lot of time in the long run.

score -1 · Answer 2 · answered Aug 23 '16 at 18:48

-1

You can do

str.scan(/(?<=<a href=").+?\.png/)

This will return an array:

["wd0c9af04bbf54efc9a2f7ba766a6694f2421b1dc.png", "thumb-wd20f381801bb51.png"]

To break down the regex

/(?<=<a href=").+?\.png/

(?<=<a href=") is a positive look behind, which matches the <a href=" string before the main expression without including it in the result.
.+? matches any character 1 or more times, with the lazy operator, which matches the smallest number of characters possible.
\.png matches the .png

answered Aug 23 '16 at 18:48

davidhu

9,523
6
32
53

1

[Using Regex on HTML is a bad idea](https://stackoverflow.com/a/1732454). – Siguza Aug 23 '16 at 18:55
The question is about javascript – AmazingDayToday Aug 23 '16 at 22:19
oh sorry, but the regex will work in js. you can check out this [link](http://stackoverflow.com/questions/13895373/javascript-equivalent-of-rubys-stringscan) on how to implement the `scan` method in javascript. – davidhu Aug 23 '16 at 23:02

score -1 · Accepted Answer · answered Aug 23 '16 at 18:53

First of all, DO NOT do this with Regex!

Regex is NOT capable of parsing HTML!

Use the javascript DOMParser instead:

var parser = new DOMParser();
var doc = parser.parseFromString(this.responseText, 'text/html');

Then use the DOM API to get the elements you need:

var nodes = doc.querySelectorAll('a:not([href="../"])');

And finally, use Array.map to map the nodes to their href attributes:

// Can't use nodes.map here because nodes in a NodeList, not an array
var links = Array.prototype.map.call(nodes, function(element)
{
    // Can't use element.href here because we're in a different document
    return element.getAttribute('href');
});

If you put that all together:

var exampleResponseText = `<html>
<head><title>Index of /browserify-view/build/source/pic/</title></head>
<body bgcolor="white">
<h1>Index of /browserify-view/build/source/pic/</h1><hr><pre><a href="../">../</a>
<a href="wd0c9af04bbf54efc9a2f7ba766a6694f2421b1dc.png">wd0c9af04bbf54efc9a2f7ba766a6694f2421b1dc..&gt;</a> 22-Jul-2016 22:29               65180
<a href="thumb-wd20f381801bb51.png">thumb-wd20f381801bb51.png;</a> 22-Jul-2016 22:33               10779
</pre><hr></body>
</html>`;

var parser = new DOMParser();
var doc = parser.parseFromString(exampleResponseText, 'text/html');
var nodes = doc.querySelectorAll('a:not([href="../"])');
var links = Array.prototype.map.call(nodes, function(element)
{
    return element.getAttribute('href');
});

console.log(links);

Million times amazing. Thanks a lot Siguza! – AmazingDayToday Aug 23 '16 at 18:58 — AmazingDayToday, Aug 23 '16 at 18:58

Regex help - include, but not show

3 Answers3

First of all, DO NOT do this with Regex!

Regex is NOT capable of parsing HTML!