Regex to return all attributes of a web page that starts by a specific value

Question

The question is simple, I need to get the value of all attributes whose value starts withhttp://example.com/api/v3?. For example, if a page contains

<iframe src="http://example.com/api/v3?download=example%2Forg">
<meta twitter="http://example.com/api/v3?return_to=%2F">

Then I should get an array/list with 2 member :http://example.com/api/v3?return_to=%2Fandhttp://example.com/api/v3?download=example%2Forg (the order doesn’t matter).

I don’t want the elements, just the attribute’s value.
Basically I need the regex that returns strings starting with http://example.com/api/v3?and ending with a space.

I couldn’t find a way to to use`queryselectorAll`to achieve this. — user2284570, Oct 02 '16 at 22:55
@KevinB : very bad idea. What if my web page has 500k elements *(lot of ads that I can’t prevent loading)* ? On Android you have either userscripts or ads blocking but not both. — user2284570, Oct 02 '16 at 22:57
then you'd have to loop through 500k elements. You've given us nothing useful to filter by. — Kevin B, Oct 02 '16 at 22:57
you could... regexp... but noone here will help you with that, and it would still involve reading all of the html. — Kevin B, Oct 02 '16 at 22:58
can narrow down the selectors to only attributes that could contain that value but otherwise you need a full dom search or more refined search criteria — charlietfl, Oct 02 '16 at 22:59
@KevinB Regex only work with strings. So you would need to serialize the HTML, and then reparse it with regex. But [you can't parse \[X\]HTML with regex.](http://stackoverflow.com/a/1732454/1529630) — Oriol, Oct 02 '16 at 22:59
`var allElements = document.querySelectorAll("*")` and then `[...allElements].filter(el =>...` by `attributes` property for the rest. If you have many elements just throw it to a worker. — Redu, Oct 02 '16 at 23:00
@KevinB : `and it would still involve reading all of the html`But in C++. It wouldn’t stall my device during 1min. — user2284570, Oct 02 '16 at 23:01
"very bad idea. What if my web page has 500k elements (lot of ads that I can’t prevent loading) ?". This is your problem. There is no magical tool for this. You have to iterate through all elements. — Ram, Oct 02 '16 at 23:02
@unlucky13 : I don’t think so when I see something like this http://stackoverflow.com/q/21975881/2284570 or this http://stackoverflow.com/a/8714421… — user2284570, Oct 02 '16 at 23:03
@Redu : I there really no better queryselector ? http://stackoverflow.com/q/21975881/2284570 http://stackoverflow.com/a/8714421 — user2284570, Oct 02 '16 at 23:04
There is no wildcard for attribute names in CSS attribute selectors. So `querySelector` is useless. You need to iterate manually. — Oriol, Oct 02 '16 at 23:04
There is a wildcard attribute, just not a wildcard attribute on elements. See the [selectors](https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_started/Selectors) here. — bitten, Oct 02 '16 at 23:06
Those selectors are looking for _specific_ elements, your question implies that _all_ attributes of all elements should be considered. If the value can be found only on specific attributes of specific elements, then there are faster options. — Ram, Oct 02 '16 at 23:07
@bitten Please tell me which attribute selector accepts a wildcard for the attribute name. There isn't anything in the [spec](https://drafts.csswg.org/selectors-4/#attribute-selectors) — Oriol, Oct 02 '16 at 23:08

Ouroborus · Answer 1 · 2016-11-13T11:41:29.503

1

There is the CSS selector * meaning "any element".

There is no CSS selector meaning "any attribute with this value". Attribute names are arbitrary. While there are several attributes defined in the HTML specs, it's possible to use custom ones like the twitter attribute in your example. This means you'll have to iterate over all the attributes on a given element.

With out a global attribute value selector, you will need to manually iterate over all elements and values. It may be possible for you to determine some heuristics to help narrow down your search before going brute force.

edited Nov 13 '16 at 11:41

answered Oct 02 '16 at 23:07

Ouroborus

16,237
4
39
62

`With out a global attribute value selector, you will need to manually iterate over all elements and values.`Is there really no rexegp that can work with `.match` to able to return the array I need ? – user2284570 Oct 02 '16 at 23:11
@user2284570 yes you could probably do that since all you want are the values... take the whole page as html string...assuming you aren't trying to modify anything – charlietfl Oct 02 '16 at 23:13
@user2284570 If you mean to apply a regex against the entire document HTML, there probably is one, but I have doubts that it would be any more effective (and would certainly be less clear) than using other means to find the attributes you're looking for. – Ouroborus Oct 02 '16 at 23:15
@Ouroborus : Doesn’t`document.querySelectorAll()`performs regexp matching over the whole page ? – user2284570 Oct 02 '16 at 23:17
@user2284570 No. It parses the selector and then applies each part of the selector to an internal version of the DOM. – Ouroborus Oct 02 '16 at 23:18
@user2284570 big difference between html as string and the DOM that contains element objects converted from the html – charlietfl Oct 02 '16 at 23:18
@charlietfl : I’d still like to try the RexExp way. – user2284570 Oct 02 '16 at 23:24
2

@user2284570 What makes you think iterating 500k elements manually will be slow, but using regex to parse a serialization of the same 500k elements will be fast? – Oriol Oct 02 '16 at 23:31
@user2284570 A regex just for pulling all attribute/value pairs is `/(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?/g`. Even that doesn't cover every case, matching against non-HTML text that's in the right format and having problems with certain combinations of spacing and quoting. – Ouroborus Oct 02 '16 at 23:51
@Oriol : Because the regex would be processed with C code ? – user2284570 Oct 03 '16 at 12:04
@Ouroborus : I know I’m processing ʜᴛᴍʟ5 pages with videos. There’s no xml on that site. But the aim is to download ᴊꜱᴏɴ links. – user2284570 Oct 03 '16 at 18:35

Sebastian Simon · Accepted Answer · 2016-10-02T23:24:47.263

A regular expression would likely look like this:

/http:\/\/example\.com\/api\/v3\?\S+/g

Make sure to escape each / and ? with a backslash. \S+ yields all subsequent non-space characters. You can also try [^\s"]+ instead of \S if you also want to exclude quote marks.

In my experience, though, regexes are usually slower than working on already parsed objects directly, so I’d recommend you try these Array and DOM functions instead:

Get all elements, map them to their attributes and filter those that start with http://example.com/api/v3?, reduce all attributes lists to one Array and map those attributes to their values.

Array.from(document.querySelectorAll("*"))
  .map(elem => Object.values(elem.attributes)
  .filter(attr => attr.value.startsWith("http://example.com/api/v3?")))
  .reduce((list, attrList) => list.concat(attrList), [])
  .map(attr => attr.value);

You can find polyfills for ES6 and ES5 functions and can use Babel or related tools to convert the code to ES5 (or replace the arrow functions by hand).

`Is there really no rexegp that can work with `.match` to able to return the array I need directly ?. That is, return all string starting with`http://example.com/api/v3?`and ending with a space. — user2284570, Oct 02 '16 at 23:17
@user2284570 I wouldn’t recommend regexes but I’ve included them in the answer. You should also take the [tip about workers by Redu](http://stackoverflow.com/questions/39822557/how-to-return-all-attributes-of-a-web-page-that-starts-by-a-specific-value-using/39822670#comment66935560_39822557) into consideration. — Sebastian Simon, Oct 02 '16 at 23:26

Regex to return all attributes of a web page that starts by a specific value

2 Answers2

Linked