0

I have been trying to figure out a solution to replace all hrefs that do not have http:// or https:// in front of a link with an appended version of the link with http:// on it.

Currently I have something like this:

static correctUrls(input: string): string {

  // get all hrefs from the input
  let urls = input.match('<a[^>]* href="([^"]*)"/g');

  // if no urls return original input
  if (!urls) {
    return input;
  }

  // remove duplicate urls
  urls = urls.filter((item, pos) => {
    return urls.indexOf(item) === pos;
  });

  // if no urls in input
  if (!urls) {
    return input;
  }

  for (const url of urls) {

    // if url does not have https
    // tslint:disable-next-line: max-line-length
    if (!url.match('^ (http: \/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$')) {
      input = input.replace(url, 'https://' + url);
    }
  }
  return input;
}

Any help would be greatly appreciated. Please include an explanation of how your answer's regex works. I have found lots of similar questions to this one, but with all of the solutions I have found, when I try to do input.match it returns the matched href twice (if there is one), but if there are two hrefs then it returns rubbish.

Here is the input:

<p> We love
  <a href="https://google.com"
     rel="noopener noreferrer"
     target="_blank">Google</a>
  and
  <a href="Facebook.com"
     rel="noopener noreferrer"
     target="_blank">Facebook</a>.
</p>

And the expected output:

<p> We love
  <a href="https://google.com"
     rel="noopener noreferrer"
     target="_blank">Google</a>
  and
  <a href="https://Facebook.com"
     rel="noopener noreferrer"
     target="_blank">Facebook</a>.
</p>
Nick Gallimore
  • 1,222
  • 13
  • 31
  • 7
    Don't use regex to parse out HTML. Use the DOM to find the anchor tags and their `href` attributes and the `URL` class to parse them. –  Oct 11 '19 at 20:16
  • I'm using Angular, gonna try creating a new HtmlElement() and setting the .innerHtml to the input and navigating the DOM that way. – Nick Gallimore Oct 11 '19 at 20:35
  • 3
    If you have the HTML as a string, you can parse it using the DOM without actually adding it to the page. https://developer.mozilla.org/en-US/docs/Web/API/DOMParser instead. –  Oct 11 '19 at 20:36
  • @Amy Thank you that's what helped me. – Nick Gallimore Oct 11 '19 at 21:24

2 Answers2

1

The correct way of doing this in Angular is to use the DOMParser. Then you can select all elements with the anchor tag. Then you can apply the regex to see if it has either http or https in front of it.

export class UrlCorrector {
  static correctUrls(input: string): string {

    const parser = new DOMParser();
    const document = parser.parseFromString(input, 'text/html');

    // get all anchor tags from the input
    const anchorTags = document.getElementsByTagName('a');

    // if no anchor tags return original input
    if (anchorTags.length === 0) {
      return input;
    }

    const urls: string[] = [];

    // iterate through all the anchor tags to find their urls
    // tslint:disable-next-line: prefer-for-of
    for (let i = 0; i < anchorTags.length; i++) {

      const href = anchorTags[i].href;
      let url = href;

      // if url has hostname in it, it's a href without http protocol
      if (href.includes(location.hostname)) {

        // get just the ending part e.g., `localhost:4200/submissions/facebook.com` will return `facebook.com`
        url = href.substr(href.lastIndexOf('/') + 1);
      }
      urls.push(url);
    }

    for (const url of urls) {

      // if url does not have a protocol append https:// to front
      // tslint:disable-next-line: max-line-length
      if (!url.match('^ (http: \/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$')) {
        input = input.replace(url, 'https://' + url);
      }
    }
    return input;
  }
}
Nick Gallimore
  • 1,222
  • 13
  • 31
0

Regex is the wrong too for the job. You're already in javascript - it's got an abundance of tools for DOM management, many of which do exactly what you want. Please, try to use these instead, they're much more applicable to your task!

If you really want to do it with regex, href="(?!https?:\/\/)()[^"]+" should do the job.

  • href=" Look for href=" string to start the match
  • (?!https?:\/\/) Assert there's no http:// or https:// at the start of the URL
  • () Empty capture at the start of the URL you want to edit - insert your string here
  • [^"]+" Match content up to the next quote mark; this is the rest of the URL

Demo

A sample Javascript program using this method:

var x = '<p> We love <a href="https://google.com" rel="noopener noreferrer" target="_blank">Google</a> and <a href="Facebook.com" rel="noopener noreferrer" target="_blank">Facebook</a>. <a href="www.example.com" rel="noopener noreferrer" target="_blank">Facebook</a>. <a href="http://www.example.com" rel="noopener noreferrer" target="_blank">Facebook</a>. </p>'
var urls = x.match('href="(?!https?:\/\/)()([^"]+)"')

console.log("https://" + urls[2])

'https://Facebook.com'

Nick Reed
  • 4,989
  • 4
  • 17
  • 37
  • Using expected input and .match result was ["href="Facebook.com"", ""] – Nick Gallimore Oct 11 '19 at 20:40
  • I'm not sure I understand the comment. Could you please clarify? – Nick Reed Oct 11 '19 at 20:40
  • But I am going to use the DOM instead. – Nick Gallimore Oct 11 '19 at 20:43
  • Sounds good. Please be sure to accept the answer if it addresses your question, OR flag a moderator to close the question (without deleting) so future users can reference it. – Nick Reed Oct 11 '19 at 20:44
  • Using chrome I get the following value for urls when executing your Javascript code: ["href="Facebook.com", "", "Facebook.com"] – Nick Gallimore Oct 11 '19 at 20:53
  • That's expected behavior - you can get just "Facebook.com" with `urls[2]`, since `urls` is an array returned by `x.match()`. You can also remove the empty capture group and reference `urls[1]` instead if you would like. – Nick Reed Oct 11 '19 at 20:56
  • 1
    Is it possible for the first regex to return ['https://google.com', 'Facebook.com'], simply just getting the hrefs? Then I was planning on using a second regex to determine if it had a protocol in front of it or not. – Nick Gallimore Oct 11 '19 at 21:02
  • That's more complicated, and again, better suited to a DOM parser. Please consider using code similar to the link in the answer. – Nick Reed Oct 11 '19 at 21:04
  • Sorry then this is not an answer – Nick Gallimore Nov 06 '19 at 14:25