0

How can you determine whether a string is a local path or a reference to another server?

For example, with the following list of URLs how can I determine which ones are in reference to the local file path or to example.com?

var paths = [
    'foo/bar.css',
    '/bar/foo.css',
    '//cdn.example.com/monkey.css',
    'https://example.com/banana.css'
];

I've looked into the url npm package, however it cannot parse the third path into an object containing the host.

What I'm trying to do is extract all of the CSS information from the page, as well as from linked stylesheets, using a utility that knows the page URL alone. I need to know where to send subsequent requests, either to the original host, or some other one such as example.com

forrestmid
  • 1,494
  • 17
  • 25
  • The third one is invalid! – ibrahim mahrir Feb 19 '17 at 22:50
  • Just check if there is a `:` in the path! – ibrahim mahrir Feb 19 '17 at 22:53
  • It's a file reference for a CSS file in an href tag. It pulls the style sheet fine, and since in don't have control over the href tags I need the URL regardless. – forrestmid Feb 19 '17 at 22:53
  • The third one doesn't have a colon. – forrestmid Feb 19 '17 at 22:53
  • It should or it will be regarded as local ressource! – ibrahim mahrir Feb 19 '17 at 22:55
  • @ibrahimmahrir Check [this](http://stackoverflow.com/questions/9646407/two-forward-slashes-in-a-url-src-href-attribute) – forrestmid Feb 19 '17 at 23:05
  • @ibrahimmahrir, the third one is a protocol-relative URL: https://www.paulirish.com/2010/the-protocol-relative-url/ – ElChiniNet Feb 19 '17 at 23:09
  • @ibrahimmahrir, the third one is definitely valid, per section 4.2 of RFC 3986. It means "keep the scheme from the base URI, while both the host and path are taken from the relative URI". – jcaron Feb 19 '17 at 23:10
  • @ibrahimmahrir, also, the presence of a `:`, even if you ignore the third example, does not mean that the file is on a different server, as one could use an absolute URI with the same host as the base URI. – jcaron Feb 19 '17 at 23:11
  • @forrestmid, are you working on `node` or in a browser? The native `url` package in node correctly resolves the third URL. – jcaron Feb 19 '17 at 23:15
  • @jcaron I'm working in node, and the `url` package doesn't resolve the hostname for the third URL for me using [this code](https://runkit.com/58a5d69a22e10200148b2666/58aa27600ef2940014a48776). – forrestmid Feb 19 '17 at 23:19
  • Obviously, how could it? You need to use `resolve` and provide the base URL before you parse it for a domain. – jcaron Feb 19 '17 at 23:27
  • @jcaron I'm not sure I'm understanding. `resolve` is used to create final URLs from a host and a path. [This](https://runkit.com/58a5d69a22e10200148b2666/58aa27600ef2940014a48776) obviously doesn't work. – forrestmid Feb 19 '17 at 23:30
  • No, `resolve` is use to create a full absolute URL from a base URL (the URL of the document where your found the target URLs) and target URLs (also called reference URIs), which may be absolute (in which case the base URL is ignored) or relative (in which case the scheme, scheme and host, or scheme, host and path of the base URL will be used and composed with the target URL to get the full absolute URL). – jcaron Feb 19 '17 at 23:33

3 Answers3

1

One simple way is to check the string for http:// or http://. You can any additional search strings as you see fit (ftp://, etc.)

var paths = ['foo/bar.css',
             '/bar/foo.css',
             '//cdn.example.com/monkey.css',
             'https://example.com/banana.css'];
             
paths.forEach(function(url){
  var yesNo = (url.indexOf("http:")=== 0 || 
               url.indexOf("https:") === 0 ||
               url.indexOf("//") === 0 && 
               (window.location.protocol === "http:" || window.location.protocol === "https:")) 
               ? "" : " NOT";
  console.log(url +  " is" + yesNo + " an external path");
});
Scott Marcus
  • 64,069
  • 6
  • 49
  • 71
1

To resolve a reference-URI, you need to provide a base URL. Only then can you fully interpret such an URL.

In your example, this would be:

var url = require("url");
var baseUrl = 'http://www.google.com/thing'; // insert your actual base URL here
var paths = [
    'foo/bar.css',
    '/bar/foo.css',
    '//cdn.example.com/monkey.css',
    'https://example.com/banana.css'
];
paths.forEach((p)=>{
    console.log(url.parse(url.resolve(baseUrl,p)).hostname);
});
jcaron
  • 17,302
  • 6
  • 32
  • 46
  • Ahh. You're my hero. Exactly what I was looking for. I didn't realize that `url.resolve` could do that. The docs provided for it don't show other URLs. – forrestmid Feb 19 '17 at 23:34
0

I would probably loop over the array and have a set of if statements. Something like this (not copy paste friendly, pseudo-code)

function checkPath(string) {
    if(string.contains("https://")
    if(string.charArray[0] == "/" && string.charArray[1] == "/")
    if(string.charArray[0] == "/" && !string.charArray[1] == "/")
    if(string.charArray[0] != "/")
}

Those if's should work to get all those strings parsed into separate functions you can use to handle them separately. Don't use libraries when it's so easy to create the function from scratch yourself.

Simon Hyll
  • 3,265
  • 3
  • 24
  • 44