1

I have made this regex to capture all types of url (it literally capture all url) but it also captures single ip.

This is my scenario: I have a list full of IP, Hash and url and my url regex and ip regex both capture the same entry. I don't know if a single ip can be considered as "url".

My regex: ((http|https)://)?(www)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,9}\b([-a-zA-Z0-9()@:%_\|+.~#?&//={};,\[\]'"$\x60]*)?

Captures all these:

http://127.0.0.1/
http://127.0.0.1
https://127.0.0.1/m=weblogin/loginform238,363,771,89816356,2167
127.0.0.1:8080 ------> excluding this one is okay too (optional)
127.0.0.1 ------> i want to exclude this one
google.com
google.com:80
www.google.com
https://google.com
https://www.google.com

I want my regex to capture all url's except single ip's like this:

127.0.0.1
  • Note: I want to use this in golang code (using golang regex engine)
  • Note: I am using regexp.Compile() and FindAllString functions.

try this regex on regex101

  • 2
    The simplest way is to filter out those strings from the result that match [IP regex](https://stackoverflow.com/a/30023010/3832970). – Wiktor Stribiżew Oct 01 '20 at 20:43
  • 1
    Another way is to use [this regex](https://regex101.com/r/JnVAwQ/1) with [`FindAllStringSubmatch`](https://golang.org/pkg/regexp/#Regexp.FindAllStringSubmatch) and only keep the Group 1 values. – Wiktor Stribiżew Oct 01 '20 at 20:50
  • @WiktorStribiżew It gets tricky because filtering with a `ip regex` might match `http://127.0.0.1` as well. However it seems like a solution worth trying, tnx. – JoshTheSideDev Oct 01 '20 at 20:52
  • Well, extend the IP matching part, see https://regex101.com/r/JnVAwQ/2 – Wiktor Stribiżew Oct 01 '20 at 20:53
  • @WiktorStribiżew Im working on your 2nd comment. Can u post a answer so i can accept it? I cannot even upvote your comment LOL – JoshTheSideDev Oct 01 '20 at 21:04

1 Answers1

1

You can use a regex implementing the "best trick ever" with FindAllStringSubmatch: match what you need to skip/omit, and match and capture what you need to keep.

\b(?:https?://)?(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b(?:[^:]|$)|((?:https?://)?(?:www)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,9}\b[-a-zA-Z0-9()@:%_\|+.~#?&//={};,\[\]'"$\x60]*)

The first alternative is an IP matching regex where I added (?:https?://)? part to match an optional protocol part and (?:[^:]|$) part to make sure there is a char other than : or end of string immediately after the IP pattern, but you may further adjust this part.

Then, use it in Go like

package main

import (
    "fmt"
    "regexp"
)

func main() {
    r := regexp.MustCompile(`\b(?:https?://)?(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b(?:[^:]|$)|((?:https?://)?(?:www)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,9}\b[-a-zA-Z0-9()@:%_\|+.~#?&//={};,\[\]'"$\x60]*)`)
    matches := r.FindAllStringSubmatch(`http://127.0.0.1/
http://127.0.0.1
http://www.127.0.0.1/m=weblogin/loginform238,363,771,89816356,2167
127.0.0.1:8080
127.0.0.1
google.com
google.com:80
www.google.com
https://google.com
https://www.google.com`, -1)
        for _, v := range matches {
            if (len(v[1]) > 0) {       // if Group 1 matched
            fmt.Println(v[1])          // Display it, else do nothing
        }
    }   
}

Output:

http://www.127.0.0.1/m=weblogin/loginform238,363,771,89816356,2167
127.0.0.1:8080
google.com
google.com:80
www.google.com
https://google.com
https://www.google.com
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563