Get all links from html page using regex

Question

I'm using Google Apps Script to fetch the content of emails from gmail and after that I need to extract all of the links from the html tags. I found some code here, on stackoverflow, and I implemented it with a regular expression, but the issue is that it is always returning me the first url. (http://vacante2016.eu/tr/17599/51743713/c4f5eadf38eb475d39e3cdeca9201538)

Is there a way to make a loop that search for the next content that matches the regex expression to display all of the elements one by one?

Here you can see an example with the content of an email that I need to get those links from: https://www.mailinator.com/inbox2.jsp?public_to=get_urls#/#public_showmaildiv

This is my code:

function getURL() {

  var threads = GmailApp.getInboxThreads();
  var message = threads[0].getMessages()[0];
  var content = message.getRawContent();

    var source = (content || '').toString();
    var urlArray = [];
    var url;
    var matchArray;

    // Regular expression to find FTP, HTTP(S) URLs.
    var regexToken = /(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/;

    // Iterate through any URLs in the text.
    while( (matchArray = regexToken.exec( source )) !== null )
    {
      var token = matchArray[0];
      urlArray.push( token );
    }
}

UPDATE: Changed the regex to /(?:ht|f)tps?\:\/\/[a-zA-Z0-9\-.]+\.[a-zA-Z]{2,3}(\/[\S=]*)?/g improved the things but now I also get the following type of response when I search for urls: "http://vacante2016.eu/clk/17599/5=\r\n1743713/150132/bf7639dd7e7aa48c9197a52a8c61e168\"><img" ... I think that the regex should also have a condition to return the url but only up to the > symbol.

Also, is there a way to remove the additional characters like =, \r and \n from the found url?

Looks like you forgot `/g`: `var regexToken = /(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/g;`. See http://stackoverflow.com/questions/520611/how-can-i-match-multiple-occurrences-with-a-regex-in-javascript-similar-to-phps — Wiktor Stribiżew, Aug 08 '16 at 13:02
If the email is formatted with html, is there a reason as to why you're not just getting the attributes straight from the tags? — NTL, Aug 08 '16 at 13:05
@NTL no, there is no reason, but I don't know how to do this...I think that the regex must search for the `href` property from `` and `` tags — Valip, Aug 08 '16 at 13:08
@WiktorStribiżew that fixed it, but now a url response that looks like this : `http://vacante2016.eu/tr/17599/51743713/c4f5eadf38eb475d39e3cde=ca9201538` will be truncated after `=` as follows: `http://vacante2016.eu/tr/17599/51743713/c4f5eadf38eb475d39e3cde` .. why does this happen? — Valip, Aug 08 '16 at 13:11
Well, the `/(?:ht|f)tps?\:\/\/[a-zA-Z0-9\-.]+\.[a-zA-Z]{2,3}(\/\S*)?/g` [should work](https://regex101.com/r/kI2yK2/1). Check what you are doing to the links or whether you check against expected contents. — Wiktor Stribiżew, Aug 08 '16 at 13:16
Still the same, this is how the fetch returns the urls: `href=3D"http://vacante2016.eu/tr/17599/51743713/c4f5eadf38eb475d39e3cde=ca9201538` and the above regex truncates the url to `http://vacante2016.eu/tr/17599/51743713/c4f5eadf38eb475d39e3cde` — Valip, Aug 08 '16 at 13:22
Are you sure there are `=`s in the input? `\S` matches *any non-whitespace symbol*. I doubt you need `/(?:ht|f)tps?\:\/\/[a-zA-Z0-9\-.]+\.[a-zA-Z]{2,3}(\/[\S=]*)?/g`, that would be too weird. — Wiktor Stribiżew, Aug 08 '16 at 14:26
Please let me know if I should post my suggestion once you figure out the issue with the `=`. In case you need more help, just update the question with your input data so that we could repro the issue on our side. — Wiktor Stribiżew, Aug 08 '16 at 17:08
@WiktorStribiżew thank you a lot for helping me, I updated my question — Valip, Aug 08 '16 at 17:23

score 3 · Accepted Answer · answered Aug 08 '16 at 17:26

3

You need to use a global modifier /g to get multiple matches with RegExp#exec.

Besides, since your input is HTML code, you need to make sure you do not grab < with \S:

/(?:ht|f)tps?:\/\/[-a-zA-Z0-9.]+\.[a-zA-Z]{2,3}(\/[^"<]*)?/g

See the regex demo.

If for some reason this pattern does not match equal signs, add it as an alternative:

/(?:ht|f)tps?:\/\/[-a-zA-Z0-9.]+\.[a-zA-Z]{2,3}(?:\/(?:[^"<=]|=)*)?/g

See another demo (however, the first one should do).

answered Aug 08 '16 at 17:26

Wiktor Stribiżew

607,720
39
448
563

The second pattern works perfect! Last question...is there a way to remove the additional characters like `=`, `\r` and `\n` from the found url such that `"http://vacante2016.eu/clk/17599/5=\r\n1743713/150132/bf7639dd7e7aa48c9197a52a8c61e168\"` will be `"http://vacante2016.eu/clk/17599/51743713/150132/bf7639dd7e7aa48c9197a52a8c61e168\"` ? – Valip Aug 08 '16 at 17:38
I don't know if these are literal strings.If yes, you will have to use sth like `.replace(/\\[rn]|=/g, '')`. – Wiktor Stribiżew Aug 08 '16 at 17:42
They are sting literals, I use `token.replace(/\\[rn]|=/g, '')` and nothing happens. To be sure I also did toke.toString() before using replace. – Valip Aug 08 '16 at 17:51
Then try `.replace(/[\r\n=]+/g, "")` – Wiktor Stribiżew Aug 08 '16 at 17:57
This works partially because only the `=` is removed. I also tried with `.replace("\r", "")` and is does nothing... – Valip Aug 08 '16 at 18:01
If you have newlines, carriage returns or equal signs, the above solution must work in JavaScript code, I will double check in Google Apps Script when kids go to bed. – Wiktor Stribiżew Aug 08 '16 at 18:40
Finally fixed it with `.replace(/(=\r\n|\n|\r)/gm, '')` – Valip Aug 08 '16 at 21:31
Ok, so that is done. I think you may shorten it to `.replace(/=\r\n|\n|\r/g, '')` – Wiktor Stribiżew Aug 08 '16 at 22:37

score -2 · Answer 2 · answered Aug 08 '16 at 13:47

-2

I am assuming based on the code you provided that you are able to get the contents of the email as an html string.

function getHref(content){
  var el = document.createElement('html');
  el.innerHTML = content;

  var hrefs = [];

  var elements = el.getElementsByTagName('a');

  for (var i=0; i < elements.length; i++){
    hrefs.push(elements[i].href);
  }

  return hrefs;
}

This will return an array of all the href attributes from anchor tags on the page.

answered Aug 08 '16 at 13:47

NTL

997
7
15

3

The `document` object is not accessible in Google Apps Scripts. That framework does not support all the JS features, only some of them. – Wiktor Stribiżew Aug 08 '16 at 13:50
This only works in browser, client-side. Google apps script is server-side, there is no DOM there at all. – roma Jan 17 '21 at 11:53

Get all links from html page using regex

2 Answers2