1

Lets say we want to extract the link in a tag like this:

input:

<p><a href="http://www.google.com/home/etc"><b>some text</b></a></p>

desired output:

http://www.google.com/home/etc

the first solution is to find the match with reference using this href=[\'"]?([^\'" >]+) regex but what I want to achieve is to match the link followed by href. so trying this (?=href\")... (lookahead assertion: matches without consuming) is still matching the href itself.

It is a regex only question.

DragonKnight
  • 1,740
  • 2
  • 22
  • 35
  • 1
    [Have you tried using an HTML parser instead?](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – jonrsharpe Oct 15 '17 at 10:21
  • If we put aside the issue of parsing HTML with regex, your regex works fine (for your example, at least). But the output depends on the exact function you use. For example, try with `re.findall()`. – randomir Oct 15 '17 at 10:25
  • no I just need the link without `href` – DragonKnight Oct 15 '17 at 10:30
  • 1
    I'm really not quite sure what you're asking, but I have a hunch that you're looking for a look*behind*: `(?<=href=['"])[^'" >]+` – Aran-Fey Oct 15 '17 at 10:32
  • this is the exact solution I was looking for. thanks. I have to edit the question. – DragonKnight Oct 15 '17 at 10:34

3 Answers3

2

One of many regex based solutions would be a capturing group:

>>> re.search(r'href="([^"]*)"', s).group(1)
'http://www.google.com/home/etc'

[^"]* matches any number non-".

user2390182
  • 72,016
  • 6
  • 67
  • 89
1

A solution could be:

(?:href=)('|")(.*)\1

(?:href=) is a non capturing group. It means that the parser use href during the matching but it actually does not return it. As a matter of fact if you try this in regex you will see there's no group holding it.

Besides, every time you open and close a round bracket, you create a group. As a consequence, ('|") defines the group #1 and the URL you want will be in group #2. The way you retrieve this info depends on the programming language.

At the end, the \1 returns the value hold by group #1 (in this case it will be ") to provide a delimiter to the URL

Neb
  • 2,270
  • 1
  • 12
  • 22
0

Make yourself comfortable with a parser, e.g. with BeautifulSoup.
With this, it could be achieved with

from bs4 import BeautifulSoup

html = """<p><a href="http://www.google.com/home/etc"><b>some text</b></a></p>"""

soup = BeautifulSoup(html, "html5lib")
print(soup.find('a').text)
# some text

BeautifulSoup supports a number of selectors including CSS selectors.

Jan
  • 42,290
  • 8
  • 54
  • 79
  • thats true so many ways to do it but its a regex only question. its about to work with back reference not just solving the problem. :) – DragonKnight Oct 15 '17 at 20:57