-2

I need parse links from html, but those of them, which are not followed by 'class="mw-disambig"'. I writed regexp

r'<a href="(.+?)"(?! class="mw-disambig")'

but it still parses something like this

  'https://ru.wikipedia.org/wiki/Тюльпан_(значения)" class='

Orginal html:

<a href="here was link" class="mw-disambig" title="Тюльпан"...>

It shouldnt be added or I'm not understanding?

What I am doing wrong?

Arzybek
  • 547
  • 2
  • 6
  • 27
  • 4
    Please don't do this in regex... [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) use BeautifulSoup – ctwheels Mar 27 '18 at 18:28
  • @Ben I'm studying, so I need to know... – Arzybek Mar 27 '18 at 18:31
  • @Arzybek can you be a little bit clearer about what you expect here? If you want to match URLs that _don't_ have `class="mw-disambig"` in them, then matching the example you provided is the correct behavior. – ethan.roday Mar 27 '18 at 18:35
  • @err1100 Yes but I do not want to have this link at all contained – Arzybek Mar 27 '18 at 18:38
  • Generally, I like to use https://www.debuggex.com/ for regexes. – Ben Mar 27 '18 at 18:38
  • Ah I think I understand. Are you using `re.match()` or `re.search()`? If you only want to match from the the beginning of the string, use `match()`. – ethan.roday Mar 27 '18 at 18:43
  • @err1100 look at the edited post now, I added something to explain – Arzybek Mar 27 '18 at 18:47
  • How are you using the regex? Can you please post a [complete code example](https://stackoverflow.com/help/mcve) where you get an unexpected result, along with what you expect the result to be? – ethan.roday Mar 27 '18 at 19:26

1 Answers1

0

".*?" doesn't mean "match precisely the shortest sequence contained in quotes. It means "match the shortest sequence contained in quotes for which following characters match the rest of the pattern".

So when the negative lookahead blocks the shortest match, the next longer match is tried, ending at the quote following class=. There the negative lookahead does not trigger.

If you just want to match a quoted string which does not contain quotes, be explicit:

"[^"]*"

(match a quote, any number of characters other than a quote, and the closing quote).

rici
  • 234,347
  • 28
  • 237
  • 341