1

Say we have the following urls:

1. http://example.com#hash0
2. http://example.com#hash0#hash1
3. http://example.com#hash0/sample.net/
4. http://example.com#hash0/sample.net/#hash1
5. http://example.com#hash0/image.jpg
6. http://example.com#hash0/image.jpg#hash1
7. something.php#?type=abc&id=123
8. something.php#?type=abc&id=123#hash0
9. something.php/?type=abc&id=#123
....................................

and more permutations of this kind, you got the point. How can I selectively remove the "irrelevant" hashes from this kind of URLs without affecting the functionality of those URLs (so that they remain complete links or images)?

For example, from number 1 in this list I would like #hash0 to be removed, from 2 both #hash0 and #hash1, from 3 I'd like to keep it, since it's followed by a continuation of the path (yes, it's possible, check here), from 4 remove only #hash1, from 5 keep it, but from 6 remove just #hash1, ... , and from 9 I think keep it, since it might have relevance to the query (not sure about it though), and so on. Basically I'd like to remove only the hashes that don't have anything usable (like paths, queries, image files, etc.) after them - "irrelevant" hashes like #top, #bottom and such, that are referring to the current page.

I'm working on something that also involves getting the absolute URLs from relative ones (with the help of either a new anchor's href or new URL object's href), so a solution (like here) that can "blend in" with the location object's properties (like .protocol, .host, .pathname, .search, .hash, etc.) is preferable - since it might be more "trustworthy" since it's built in, but a good (and shorter) regex would be acceptable as well. All in all, shorter solutions are preferable, as I don't want my project to do extra unnecessary work for every link or image link that it encounters while it parses the entire current URL.

Yin Cognyto
  • 986
  • 1
  • 10
  • 22
  • How do you know which hashes are "irrelevant"? Do they change? I would start by looking at `String.replace()` – mhodges Aug 25 '17 at 17:18
  • @mhodges Hashes that are not followed by valid path / query / image link sections are "irrelevant" in my case. For example, the hashes from 3, 5 or 7 (maybe even 9) or the first hash from 4, 6 or 8 are relevant, since they are followed by "other than hashes" URL parts. – Yin Cognyto Aug 25 '17 at 17:37

1 Answers1

1

Maybe this what you want, with a regular expression.

var urls = [
        'http://example.com#hash0',                   // remove
        'http://example.com#hash0#hash1',             // remove
        'http://example.com#hash0/sample.net/',       // keep
        'http://example.com#hash0/sample.net/#hash1', // remove #hash1
        'http://example.com#hash0/image.jpg',         // keep
        'http://example.com#hash0/image.jpg#hash1',   // remove #hash1
        'something.php#?type=abc&id=123',             // keep
        'something.php#?type=abc&id=123#hash0',       // remove #hash0
        'something.php/?type=abc&id=#123',            // remove #123
    ],
    result = urls.map(h => h.replace(/(?:#[^#\/\?\.]*)*#[^#\/\?\.]*$/gi, ''));
    
console.log(result);
.as-console-wrapper { max-height: 100% !important; top: 0; }
Nina Scholz
  • 376,160
  • 25
  • 347
  • 392
  • Yeah, that's something that crossed my mind too, but it needs to include potential valid queries or images too... – Yin Cognyto Aug 25 '17 at 17:39
  • the problem is, unless you do not specify a white list or a black list, any solution is just an attempt. – Nina Scholz Aug 25 '17 at 17:55
  • I know - that's why I asked this question here. If I did have a bullet-proof solution, I would'n't have asked ;) However, I explained in my question what a general "white list" would look like: valid path, valid query, valid image, etc. Basically anything that's not a "simple hash". If I could treat the available hashes as URLs in themselves, I could recoursively analyze each hash for other parts that are relevant and exclude the hashes that don't contain such parts (e.g. _h=location.hash; if ((h.path==="") && (h.search==="") && (!h.endsWith(".jpg"))) {h="";}_) or such. – Yin Cognyto Aug 25 '17 at 18:14
  • Ok, please replace the searched regex with `(?:#[^#\/\?\.]*)*#[^#\/\?\.]*$/gi` in your answer, so that I can accept it. The flags aren't required, but I like to have them in my regexes just in case, or for testing purposes. The rest of the regex looks for single / multiple, one after the other URL fragments without `/`, `?` or `.` inside them at the end of the main URL. I am aware that this allows single `/`, `?` or `.` as hashes, but it seems the closest in allowing only URL fragments that look like either paths, queries or webpages/images. See [link](http://regexr.com/3gk87) for testing. – Yin Cognyto Aug 25 '17 at 21:04
  • the result here looks different, than at the regex site. – Nina Scholz Aug 25 '17 at 21:11
  • Yes, I know. I actually didn't type the query part correctly in my question for the 7th and 8th URLs, I missed the critical `?` character - that's the explanation. I figured out that what I initially wrote were not actual queries, but more like other sorts of parameters, which I didn't care about... I edited my question to correct that, and you might add the missing `?` from those URLs in your answer as well, to make it right. I'll accept your answer anyway, but the correction(s) might be appropriate, in order for the code to behave as desired. – Yin Cognyto Aug 25 '17 at 21:28