1

I have a regex expression that returns me all the links from a html file, but it has a problem: instead of returning just the link, like http://link.com, it also returns the href=" (href="http://link.com). What can I do to only get the links without having that href=" ?

This is my regex:

/href="(http|https|ftp|ftps)\:\/\/[-a-zA-Z0-9.]+\.[a-zA-Z]{2,3}(?:\/(?:[^"<=]|=)*)?/g

Full code:

  var source = (body || '').toString();
  var urlArray = [];
  var url;
  var matchArray;

  // Regular expression to find FTP, HTTP(S) URLs.
  var regexToken = /href="(http|https|ftp|ftps)\:\/\/[-a-zA-Z0-9.]+\.[a-zA-Z]{2,3}(?:\/(?:[^"<=]|=)*)?/g;

  // Iterate through any URLs in the text.
  while( (matchArray = regexToken.exec( source )) !== null )
  {
    var token = matchArray[0];
    token = JSON.stringify(matchArray[0]);
    token = matchArray[0].toString();
    urlArray.push([ token ]);
  }
Valip
  • 4,440
  • 19
  • 79
  • 150
  • Why complicate it that much? `/href="([^"]+)"/g` (if you know the input will always have attribute values in double quotes) – Wiktor Stribiżew Aug 24 '16 at 06:42
  • You should not parse HTML with regex. Use a proper parser. Or [bad things can happen](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – Amadan Aug 24 '16 at 06:45
  • @WiktorStribiżew I tried this, but it also returns me the email addresses, and I don't want this – Valip Aug 24 '16 at 06:45
  • @Amadan I'm fetching the body content from emails (in HTML format) and the type of response is a string. So I have a string that contains html attributes :) – Valip Aug 24 '16 at 06:47
  • No problem, just add `http`: `/href="(http:\/\/[^"]+)"/g`. Anyway, your regex looks like JS, and in JS, I would rather use DOM to grab all the hrefs and keep those that start with `http`. The regex is not really helpful to do this type of job for arbitrary HTML contents. – Wiktor Stribiżew Aug 24 '16 at 06:51
  • A string with HTML attributes *is HTML*. Use a HTML parser. Python: Beautiful Soup. Ruby: Nokogiri. JavaScript: `DOMParser`. Every common language has one. Otherwise, this could pick up text, not markup, if someone sends, say, an email about how to write HTML. – Amadan Aug 24 '16 at 06:51
  • @WiktorStribiżew I'm using JS but with Google Apps Script which does not support DOM manipulation... – Valip Aug 24 '16 at 06:56
  • That information should be added as a tag to the question. Please post full relevant code. The point is to use a capturing group and grab Group 1 contents via `match[1]` index. You need to use `RegExp#exec` if you have a regex with a global modifier. – Wiktor Stribiżew Aug 24 '16 at 06:57
  • @WiktorStribiżew your solution works exactly like mine, but it still adds that `href="` text before the links...I want to only get the content of the `href` property – Valip Aug 24 '16 at 07:02
  • You have not posted the full code. I mean: your regex and mine are both **FINE**. Your code is **NOT**. – Wiktor Stribiżew Aug 24 '16 at 07:03
  • @WiktorStribiżew I'm using the `RegExp#exec`, I only need to get the content of `href` without adding that `href="` before the result – Valip Aug 24 '16 at 07:04
  • Show the code.. Mind that `RegExp#exec` does not fetch you *all* matches, only one by one. After each match it advances the regex index and at the next iteration, if the result is not null, you get the next match (with all capturing groups). – Wiktor Stribiżew Aug 24 '16 at 07:04
  • Use `var token = matchArray[1];`. Your value is in *Group **1***. Also, you do not need `.toString()` I believe. – Wiktor Stribiżew Aug 24 '16 at 07:07
  • For gathering URLs I would prefer this: document.querySelectorAll("[src],[href]") – Lajos Arpad Aug 24 '16 at 07:40
  • @LajosArpad thank you, but as I mentioned above, google apps script does not support DOM manipulation – Valip Aug 24 '16 at 10:27
  • Pavel, the code I have been providing was reading only from the dom, not writing. Is that also unsupported? – Lajos Arpad Aug 24 '16 at 11:15
  • Yes, this is also unsupported – Valip Aug 24 '16 at 11:41
  • Thanks for letting me know about that, Pavel. – Lajos Arpad Aug 24 '16 at 11:54

1 Answers1

1

RegExp#exec will store all contents captured by the capturing groups that are defined in your pattern. You may access Group 1 with [1] index.

Use

var token = matchArray[1];

Also, I believe you can shorten the regex to just

/\bhref="((?:http|ftp)[^"]+)"/g

if you are sure the values are always inside double quotes. See this demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I modified the code, and with your help the result is improved, but still has a problem...now the links have `"` before (like this: `"https://link.com`) – Valip Aug 24 '16 at 07:13
  • That is not possible, just log the `matchArray[1]` value. You get it inside quotes because you `JSON.stringify` it. – Wiktor Stribiżew Aug 24 '16 at 07:16
  • You're right, the `JSON.stringify` messed up the things, now everything works! – Valip Aug 24 '16 at 07:18