1

So the problem I am facing is more of logical reasoning which I am unable to figure out for some reason, it is Regex and coding related.

This is a pattern I use to extract links from a document;

http(s)?://([\w+?\.\w+])+([a-zA-Z0-9\~\!\^\&\*\(\)_\-\=\+\\\?\/\.\:\;\'\,]*)?

It took me a while to compile it together, but it works really well, extracts links from all the document, however my issue is, if two links are connected, it extracts them as a single match.

I tried placing "http" at the end of regex pattern to supposedly end the search, but that didn't work. For example, two links as follow show up as one single match (They are found like that in the original document);

http://www.preemptive.com/dotfuscator/dtd/dotfuscatorMap_v1.0.dtd/dotfuscatorMap_v1.0.dtdhttp://www.preemptive.com/dotfuscator/dtd/dotfuscatorMap_v1.1.dtd/dotfuscatorMap_v1.1.dtd

Regex code if you want to take a look;

Dim regexFunc As New Regex("http(s)?://([\w+?\.\w+])+([a-zA-Z0-9\~\!\^\&\*\(\)_\-\=\+\\\?\/\.\:\;\'\,]*)?", RegexOptions.IgnoreCase)
        Dim matches As MatchCollection = regexFunc.Matches(_dataLoaded.ToString)

        For Each x As Match In matches
            '// A match has been found, can contain one or more links connected.
        Next

Question: How to have it so when if a match has multiple links, it separates each of links so I could store each of them in.. say an array? Thanks.

Karizan
  • 29
  • 6
  • 1
    Try `"https?://\w+(?:\.\w+)+(?:(?!https?://)[a-zA-Z0-9~!^&*()_=+\\?/.:;',-])*"`, see https://regex101.com/r/ihSKvA/2 (do not copy/paste the pattern from this comment, there are garbage chars after `()`) – Wiktor Stribiżew Jan 21 '17 at 15:37
  • Nice quantifier useage... @Wiktor Stribiżew – Trevor Jan 21 '17 at 16:13
  • It works pretty well actually. Made a few changes here and there to fit my needs, but overall it does the job. Thanks alot for the website too @WiktorStribiżew – Karizan Jan 21 '17 at 22:52

1 Answers1

0

You may temper the greedily quantfied character class with a negative lookahead (a so called tempered greedy token):

https?://\w+(?:\.\w+)+(?:(?!https?://)[a-zA-Z0-9~!^&*()_=+\\?/.:;',-])*
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

See the regex demo (unnecessary groups and escapes are removed).

Note that [\w+?\.\w+] is a character class (due to the unescaped square brackets) that matches 1+ chars that are either word chars or +, ? or .. So, I suggest rewriting it like \w+(?:\.\w+)+ (adjust according to you requirements).

If the regex is stored in some sort of XML, the & is OK, otherwise just replace with &.

Details:

  • https?:// - http:// or https://
  • \w+ - 1+ word chars
  • (?:\.\w+)+ - 1+ sequences of a dot and 1+ word chars
  • (?:(?!https?://)[a-zA-Z0-9~!^&*()_=+\\?/.:;',-])* - a tempered greedy token matching any char defined in the character class that does not start a http:// or https:// character sequence.
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563