1

I've a working RegEx I'm using to list down all the links found in a given html content

<a\s[^>]*href=(\"??)(http[^\" >]*?)\\1[^>]*>(.*)<\/a>

This is actually working pretty good, the problem now is that I want to exclude from the results all the internal links (at a first look it would be enough to get only the ones including "http", but unfortunately is plenty of internal "absolute" links..)

Given that I know the website url, I don't need help to get that, so let's just assume is www.test.com / test.com

I had a look at the Negative Lookahead reference, but I'm not sure how it should be implemented in the existing RegEx..

Thanks Cheers

giovanni
  • 349
  • 2
  • 4
  • 14
  • [D̵̨̘̗̤̬̙̂̇̓͂͘͜o̴̜͌̊̍̄́̇̒̚ ̴̡̬͔̮͓̬̖̼͓͚̄͛n̸̦͔̐̒̂ǫ̶̢̡̝̝̭̝̤̰̍͒̂͌̄̊t̵̡̲̮̑͐̂̎͠ ̷̥̻̹̗̲̜̭̣̏́́̍͂͐͘͝p̸̘̩̳̯̫̣͚̲͓͚͊̈́͆̆͝͠á̷̼͓̟̣͋̃͗̈̊͠͝͝ŕ̷̢̘̝̲͂s̵̨̱͖͚̦͇̮̑͜ȩ̷̜̱̈͆͋͘ ̵̡̛̪̝̺̤̙̜͐̎̏́Ḣ̵̛̟̦͂̋̉̄̄͝͠ͅT̸̨͍̻̥̯̥͓̦͙͛̔̿̊̃̓͑̋͘M̸̼͎͑̈́̅͛̚L̷̲͝ ̷̨̇̅̐̄̃̋͐̃̕w̴͓̄į̸̢̙͚̩̰͋̒̃̏́̂́͘t̸̼̩̜̪̹̬͛̌͌́͜h̴͍̐̉̃̑̀ ̴̜͇̙̹̜̙̀̎̄́̂̑͗͋̅̚R̷̡͕̦͚͇̹̬̤͎̔̈́͛̍͜ē̸̩͓̘̺̐̇͠ġ̵̯̖͕̖̫̺̠̫͇͜ë̴̙͔̯̩̘͔̠̦͍́̔̈́̆̅̈́̿̆̕̚x̵̟̟̞̰̣͕͙̠͝](https://stackoverflow.com/a/1732454/5827005) – GrumpyCrouton Dec 03 '18 at 19:28
  • 1
    I'll give you a regex, give me a minute –  Dec 03 '18 at 19:58

1 Answers1

1

The easiest way is to create a blacklist of sites using an alternation
in combination with a (*SKIP)(*FAIL).
This way the engine moves past the offending urls and cannot backtrack.

(?:<a(?=\s)(?=(?:[^>"']|"[^"]*"|'[^']*')*?\shref\s*=\s*(?:(['"])(?:(?!\1)[\S\s])*?(?:www\.test\.com|test\.com)(?:(?!\1)[\S\s])*?\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>.*?</a\s*>(*SKIP)(*FAIL)|<a(?=\s)(?=(?:[^>"']|"[^"]*"|'[^']*')*?\shref\s*=\s*(?:(['"])([\S\s]*?)\2))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>(.*?)</a\s*>)

https://regex101.com/r/hpwUr3/1

The stuff you want is:
- Group 3 = url
- Group 4 = content

Explained

 (?:
      # Begin Offender Anchor tag
      < a
      (?= \s )
      (?=                           # Asserttion for:  href  (a pseudo atomic group)
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
           \s href \s* = \s* 
           (?:
                ( ['"] )                      # (1)
                (?:
                     (?! \1 )
                     [\S\s] 
                )*?
                (?:                           # Add more offenders here
                     www \. test \. com
                  |  test \. com 
                )
                (?:
                     (?! \1 )
                     [\S\s] 
                )*?
                \1 
           )
      )
                                    # Have the href offendeer, just match the rest of tag
      \s+ 
      (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+

      >                             # End  tag

      .*? 
      </a \s* >
      (*SKIP) (*FAIL)               # Move past the offender
   |  

      # Begin Good Anchor tag
      < a
      (?= \s )
      (?=                           # Asserttion for:  href  (a pseudo atomic group)
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
           \s href \s* = \s* 
           (?:
                ( ['"] )                      # (2)
                ( [\S\s]*? )                  # (3), Good link
                \2 
           )
      )
                                    # Have the href good one, just match the rest of tag
      \s+ 
      (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+

      >                             # End  tag

      ( .*? )                       # (4), Content
      </a \s* >
 )