0

I'm creating a little thing that parses HTML and replaces the links, how can I achieve this using a regex?

My previous approach was something like this (I now realize that I accidentally made a capture group for the if statement rather than just for context, but I'm still stuck) for the ones that don't start with the HTTP tag) and then some little code operations, but the regex doesn't seem to work.

I essentially want to filter out the URL and do https://example.org/?foo=<url>

Examples:

(with base URL example.com)

<script src="scripts/something.js"/>
<img href="https://cdn.some_website.com/foo.jpg"/>
<a href="/hello">Hello world!</a>

turns into this:

<script src="https://example.org/?url=https://example.com/scripts/something.js">
<img href="https://example.org/?url=https://cdn.some_website.com/img/foo.jpg"/>
<a href="https://example.org/?url=http://example.com/?url=https://example.com/hello">Hello world!</a>

I would prefer for this to be all regex if possible, but I might have to use code for some of the more complicated parts of this. Does anyone know of a regex that would work here?

tadman
  • 208,517
  • 23
  • 234
  • 262

1 Answers1

1

You'll need to do this in steps:

  1. Replace relative paths with absolute ones:
    (src|href)="(?!\/|http|ftp|#) -> $1="<relative_path_prefix>
    Example of replacement: $1="/, or $1="/root_folder/next_folder/, or . Finding what is a correct relative prefix out of the scope of this question. You'll have to find it out while iterating through pages.
  2. Replace absolute paths with fully qualified ones:
    (src|href)="(?=\/) -> $1="<current_domain>
    In your example $1="http://example.com
  3. Replace any link with desired:
    (src|href)="(?!mailto:|tel:|javascript:|#)(.*?)"" -> $1="https://example.org?path=$2"

After this links to elements (like href='#header'), e-mail links and phone links, and js triggers will stay in place, while all others wll be replaced like you described.

markalex
  • 8,623
  • 2
  • 7
  • 32