Replace everything before domain javascript (Google Apps Script) regex

Question

Assuming a single subdomain, how do I replace everything in the URL before the domain and any trailing slashes?

Example strings:
https://www.google.com/
http://net.tutsplus.com/about

The result I want (from my example strings) is:
google.com
tutsplus.com/about

Currently, the regex I'm using is:
^https?:\/\/'

Which results in:
www.google.com/
net.tutsplus.com/about

This replaces everything up to the slashes in the URL, but I want to replace everything up to the first .

My current code in Apps Script is:

var body = DocumentApp.getActiveDocument().getBody();
body.replaceText('^https?:\/\/', '');

Given that I'm using Google Apps Script, it could be an issue with how replaceText() works. Thanks in advance for the help.

I would be surprised if there is not a JavaScript library for doing this. Have you looked into this? — Tim Biegeleisen, Feb 05 '16 at 01:44
Try `^https?:\/\/.*?\.`to match everything up to and including the first `.`. — sideroxylon, Feb 05 '16 at 01:48
@CBroe that's not at all constructive, and there's no reason for the hostility. I didn't include an exhaustive list of what I've tried for fear of cluttering the question. I've tried `^https?:\/\/.*\.$` and a number of variations. — SwankyLegg, Feb 05 '16 at 01:56
@TimBiegeleisen A plain regex should be able to get me there. I don't want to import a library into Google Apps Script, partly because it's clunky and partly because it *shouldn't* be necessary. — SwankyLegg, Feb 05 '16 at 01:58
You’ll want to match one or more characters that aren’t a dot, and then a dot. Show us something that can be considered a halfway serious attempt at doing that, then we can see where it goes from there. — CBroe, Feb 05 '16 at 02:08
@CBroe the match in a normal regex would be `^https?:\/\/\w+\.` to do the first bit. You're incorrect about ruling out the implementation of `replaceText()`. As per the Apps Script docs: "A subset of the JavaScript regular expression features are not fully supported, such as capture groups and mode modifiers." Using `\w+` will literally match the character `w` and be greedy, resulting in `google.com/` and `http://net.tutsplus.com/about`. — SwankyLegg, Feb 05 '16 at 02:19
_“Using `\w+` will literally match the character `w`”_ – that makes no sense whatsoever. If that was really the syntax to match a specific character, then I don’t see how you could match `h`, `t`, `p` and `s` upfront without using said syntax there as well. — CBroe, Feb 05 '16 at 02:24
@CBroe `\w+` should match "any alphanumeric character including the underscore" as per the MDN docs. Using `^https?:\/\/\w+` returns `.google.com/` and `http://net.tutsplus.com/about` using the code I've provided verbatim. You seem to have a high opinion of your own ability to solve this, but it were really that easy for you, you could've already done so, answered my question, and provided an explanation. — SwankyLegg, Feb 05 '16 at 02:33
Can we assume that `body` contains both of your example strings at the same time, and you are running this `replaceText` only once? Then I’d be very interested to hear what the result is when you switch the order of those two example strings around in your body text, so that `http://net.tutsplus.com/about` comes first, and `https://www.google.com/` at a later position. — CBroe, Feb 05 '16 at 02:39
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/102652/discussion-between-swankylegg-and-cbroe). — SwankyLegg, Feb 05 '16 at 02:42

score 1 · Answer 1 · answered Feb 05 '16 at 16:56

1

It looks like Google Doc's regex implementation is weak. It doesn't support capturing group, so you will run into problems with the following:

http://hoffmaninstitute.co.uk
http://google.com
http://docs.aws.amazon.com/

Assume that the text is always http://+one_sub_domain+domain+tld, you can use:

  var body = DocumentApp.getActiveDocument().getBody();
  body.replaceText('^https?://[0-9A-Za-z_]+\.', '');

answered Feb 05 '16 at 16:56

daniel

638
4
14

1

thanks. This doesn't actually work for the links in the example strings. Thanks for trying, though. – SwankyLegg Feb 05 '16 at 17:38

score 0 · Accepted Answer · edited May 23 '17 at 11:52

From Apps Script's .replaceText() docs:

Replaces all occurrences of a given text pattern with a given replacement string, using regular expressions.
A subset of the JavaScript regular expression features are not fully supported, such as capture groups and mode modifiers.

It will only accept strings as arguments. Implementing my own regex search and replace is unnecessarily complex because it necessitates converting each object type to be the appropriate Apps Script object before actually issuing a replacement.

I failed to note that subdomains should only be replaced if they're www due to some unforeseen link string formats that require a subdomain to be readable. For reference, here's a more thorough set of link formats:

https://www.google.com/
https://www.google.com
https://google.com/
https://google.com
http://www.google.com/
http://www.google.com
http://google.com
https://product.google.com/about/
https://product.google.com/about
https://product.google.com/
https://product.google.com
http://product.google.com/about/
http://product.google.com/about
http://product.google.com/
http://product.google.com

While the following is inefficient and verbose, it works:

function replaceLongUrls(element) {
    element = element || DocumentApp.getActiveDocument().getBody();

    element.replaceText('^https?:\/\/', '');
    element.replaceText('^www.', '');
    element.replaceText('/$', '');
};

Sources:
Apps Script Documentation
Google Apps Script Regex exec() returning null
replaceText() RegEx "not followed by"

Replace everything before domain javascript (Google Apps Script) regex

2 Answers2