3

Assuming a single subdomain, how do I replace everything in the URL before the domain and any trailing slashes?

Example strings:
https://www.google.com/
http://net.tutsplus.com/about

The result I want (from my example strings) is:
google.com
tutsplus.com/about

Currently, the regex I'm using is:
^https?:\/\/'

Which results in:
www.google.com/
net.tutsplus.com/about

This replaces everything up to the slashes in the URL, but I want to replace everything up to the first .

My current code in Apps Script is:

var body = DocumentApp.getActiveDocument().getBody();
body.replaceText('^https?:\/\/', '');

Given that I'm using Google Apps Script, it could be an issue with how replaceText() works. Thanks in advance for the help.

SwankyLegg
  • 473
  • 4
  • 14
  • I would be surprised if there is not a JavaScript library for doing this. Have you looked into this? – Tim Biegeleisen Feb 05 '16 at 01:44
  • 2
    Try `^https?:\/\/.*?\.`to match everything up to and including the first `.`. – sideroxylon Feb 05 '16 at 01:48
  • @sideroxylon That results in: `ww.google.com/` – SwankyLegg Feb 05 '16 at 01:53
  • @CBroe that's not at all constructive, and there's no reason for the hostility. I didn't include an exhaustive list of what I've tried for fear of cluttering the question. I've tried `^https?:\/\/.*\.$` and a number of variations. – SwankyLegg Feb 05 '16 at 01:56
  • @TimBiegeleisen A plain regex should be able to get me there. I don't want to import a library into Google Apps Script, partly because it's clunky and partly because it *shouldn't* be necessary. – SwankyLegg Feb 05 '16 at 01:58
  • You’ll want to match one or more characters that aren’t a dot, and then a dot. Show us something that can be considered a halfway serious attempt at doing that, then we can see where it goes from there. – CBroe Feb 05 '16 at 02:08
  • @CBroe the match in a normal regex would be `^https?:\/\/\w+\.` to do the first bit. You're incorrect about ruling out the implementation of `replaceText()`. As per the Apps Script docs: "A subset of the JavaScript regular expression features are not fully supported, such as capture groups and mode modifiers." Using `\w+` will literally match the character `w` and be greedy, resulting in `google.com/` and `http://net.tutsplus.com/about`. – SwankyLegg Feb 05 '16 at 02:19
  • _“Using `\w+` will literally match the character `w`”_ – that makes no sense whatsoever. If that was really the syntax to match a specific character, then I don’t see how you could match `h`, `t`, `p` and `s` upfront without using said syntax there as well. – CBroe Feb 05 '16 at 02:24
  • @CBroe `\w+` should match "any alphanumeric character including the underscore" as per the MDN docs. Using `^https?:\/\/\w+` returns `.google.com/` and `http://net.tutsplus.com/about` using the code I've provided verbatim. You seem to have a high opinion of your own ability to solve this, but it were really that easy for you, you could've already done so, answered my question, and provided an explanation. – SwankyLegg Feb 05 '16 at 02:33
  • Can we assume that `body` contains both of your example strings at the same time, and you are running this `replaceText` only once? Then I’d be very interested to hear what the result is when you switch the order of those two example strings around in your body text, so that `http://net.tutsplus.com/about` comes first, and `https://www.google.com/` at a later position. – CBroe Feb 05 '16 at 02:39
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/102652/discussion-between-swankylegg-and-cbroe). – SwankyLegg Feb 05 '16 at 02:42

2 Answers2

1

It looks like Google Doc's regex implementation is weak. It doesn't support capturing group, so you will run into problems with the following:

  • http://hoffmaninstitute.co.uk
  • http://google.com
  • http://docs.aws.amazon.com/

Assume that the text is always http://+one_sub_domain+domain+tld, you can use:

  var body = DocumentApp.getActiveDocument().getBody();
  body.replaceText('^https?://[0-9A-Za-z_]+\.', '');
daniel
  • 638
  • 4
  • 14
0

From Apps Script's .replaceText() docs:

Replaces all occurrences of a given text pattern with a given replacement string, using regular expressions.
A subset of the JavaScript regular expression features are not fully supported, such as capture groups and mode modifiers.

It will only accept strings as arguments. Implementing my own regex search and replace is unnecessarily complex because it necessitates converting each object type to be the appropriate Apps Script object before actually issuing a replacement.

I failed to note that subdomains should only be replaced if they're www due to some unforeseen link string formats that require a subdomain to be readable. For reference, here's a more thorough set of link formats:

https://www.google.com/
https://www.google.com
https://google.com/
https://google.com
http://www.google.com/
http://www.google.com
http://google.com
https://product.google.com/about/
https://product.google.com/about
https://product.google.com/
https://product.google.com
http://product.google.com/about/
http://product.google.com/about
http://product.google.com/
http://product.google.com

While the following is inefficient and verbose, it works:

function replaceLongUrls(element) {
    element = element || DocumentApp.getActiveDocument().getBody();

    element.replaceText('^https?:\/\/', '');
    element.replaceText('^www.', '');
    element.replaceText('/$', '');
};

Sources:
Apps Script Documentation
Google Apps Script Regex exec() returning null
replaceText() RegEx "not followed by"

Community
  • 1
  • 1
SwankyLegg
  • 473
  • 4
  • 14