how to look behind in regex without matching a pattern itself?

Question

Lets say we want to extract the link in a tag like this:

input:

<p><a href="http://www.google.com/home/etc"><b>some text</b></a></p>

desired output:

http://www.google.com/home/etc

the first solution is to find the match with reference using this href=[\'"]?([^\'" >]+) regex but what I want to achieve is to match the link followed by href. so trying this (?=href\")... (lookahead assertion: matches without consuming) is still matching the href itself.

It is a regex only question.

[Have you tried using an HTML parser instead?](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — jonrsharpe, Oct 15 '17 at 10:21
If we put aside the issue of parsing HTML with regex, your regex works fine (for your example, at least). But the output depends on the exact function you use. For example, try with `re.findall()`. — randomir, Oct 15 '17 at 10:25
I'm really not quite sure what you're asking, but I have a hunch that you're looking for a look*behind*: `(?<=href=['"])[^'" >]+` — Aran-Fey, Oct 15 '17 at 10:32
this is the exact solution I was looking for. thanks. I have to edit the question. — DragonKnight, Oct 15 '17 at 10:34

score 2 · Answer 1 · answered Oct 15 '17 at 10:31

2

One of many regex based solutions would be a capturing group:

>>> re.search(r'href="([^"]*)"', s).group(1)
'http://www.google.com/home/etc'

[^"]* matches any number non-".

answered Oct 15 '17 at 10:31

user2390182

72,016
6
67
89

score 1 · Answer 2 · answered Oct 15 '17 at 12:04

A solution could be:

(?:href=)('|")(.*)\1

(?:href=) is a non capturing group. It means that the parser use href during the matching but it actually does not return it. As a matter of fact if you try this in regex you will see there's no group holding it.

Besides, every time you open and close a round bracket, you create a group. As a consequence, ('|") defines the group #1 and the URL you want will be in group #2. The way you retrieve this info depends on the programming language.

At the end, the \1 returns the value hold by group #1 (in this case it will be ") to provide a delimiter to the URL

score 0 · Answer 3 · answered Oct 15 '17 at 16:34

0

Make yourself comfortable with a parser, e.g. with BeautifulSoup.
With this, it could be achieved with

from bs4 import BeautifulSoup

html = """<p><a href="http://www.google.com/home/etc"><b>some text</b></a></p>"""

soup = BeautifulSoup(html, "html5lib")
print(soup.find('a').text)
# some text

BeautifulSoup supports a number of selectors including CSS selectors.

answered Oct 15 '17 at 16:34

Jan

42,290
8
54
79

thats true so many ways to do it but its a regex only question. its about to work with back reference not just solving the problem. :) – DragonKnight Oct 15 '17 at 20:57

how to look behind in regex without matching a pattern itself?

3 Answers3