Regular Expression to extract all the links and the corresponding link text

Question

I'm brand new to regular expressions, and I am trying to solve the two following problems:

Write a regular expression that extracts all the links and the corresponding link text from an HTML page. For example, if you wanted to parse:
```
 text1 <a href="http://example.com">hello, world</a> text2
```

and get the result

http://example.com <tab> hello, world

Do the same thing, but also handle cases where <...> are nested:

  text1 <a href="http://example.com" onclick="javascript:alert('<b>text2</b>')">hello, world</a> text3

So far I am still on the first question, and I've tried going about this several ways. I think my best answer to the first has been the regex (?<=a href=\")(.*)(?=</a>) which gives me: http://example.com">hello, world

This seems good enough to me, but I don't know how I'm supposed to approach the second part. Any help or insight would be greatly appreciated.

regex are bad with nesting. You should consider a real html parser. — Jean-François Fabre, Dec 15 '16 at 20:04
So how should I answer the question? Just say plz no regex for html parsing? — Zach Ellis, Dec 15 '16 at 20:14
Where are the questions from? Question 2 seems like the exact reason you wouldnt use a regex for this. — chris85, Dec 15 '16 at 20:28
Where did the question come from? If it's from some coursework where you have to have a regex solution, then you can probably hack together some regex that works most of the time for simplistic inputs — Patrick Haugh, Dec 15 '16 at 20:29
It was a quiz a recruitment office sent me and I barely got any experience with regex while getting my CS degree. — Zach Ellis, Dec 15 '16 at 20:35
Here's one I came up with. There are inputs that will break it, but it works for the two you gave. https://regex101.com/r/x2uUUO/3 — Patrick Haugh, Dec 15 '16 at 20:36
I was testing these using regex101.com, so I was using python syntax, but they probably wanted it in perl. And Thanks @PatrickHaugh, (.*) pretty much gave me the results I was looking for. I'm such a noob at this that I don't really know if it matters that the results are split into group 1 and 2. — Zach Ellis, Dec 15 '16 at 20:56
@Jean-FrançoisFabre: *"regex are bad with nesting"*, it isn't the (only/main) problem. With regex engines that supports recursion or the .net regex engine that are able to deal with nested structures, or with html strings/targets without nested tags, the html syntax stays unpredictable *(eg: an attribute value can be enclosed between single, double, no quotes or doesn't exist...)*, very permissive, with several versions, and user-agents are very tolerant to deal with that. — Casimir et Hippolyte, Dec 15 '16 at 21:04

score 1 · Answer 1 · answered Dec 16 '16 at 21:02

With regular expressions, sometimes it's better to look at what you shouldn't capture than what you should to get what you want. This perl regex should reliably capture simple links and their related text:

#!perl

use strict;
use warnings;

my $sample = q{text1 <a href="http://example.com">hello, world</a> text2};

my ($link, $link_text) = $sample =~ m{<a href="([^"]*)"[^>]*>(.*?)</a>};

print "$link \t $link_text\n";

1;

This will print:

http://example.com <tab> hello, world

To break down what it's doing:

The first capture, ([^"]*), is looking for 0 or more characters inside of the href attribute that are not a double-quote. The square brackets are used to list a ranges of characters and the leading carat tells the regex to look for any character that is not in this range.

Similarly, I use [^>]*> to find the a tag's closing bracket without needing to worry about what other attributes may be in the tag.

Lastly, (.*?) is a 0 or more non-greedy capture (indicated by the question mark) to capture all text inside of that link only. Without the non-greedy indicator it will match all text to the very last closing </a> tag in the document.

Hopefully this will help you solve part 2 of the assignment. :)

score 0 · Answer 2 · answered Dec 15 '16 at 20:32

If you were to solve it with an HTML parser like BeautifulSoup, it would simply come down to locating the a element, using dictionary-like access to the href attribute and get_text() for getting a text of an element:

In [1]: from bs4 import BeautifulSoup

In [2]: l = [
    """text1 <a href="http://example.com">hello, world</a> text2""", 
    """text1 <a href="http://example.com" onclick="javascript:alert('<b>text2</b>')">hello, world</a> text3"""
]

In [3]: for s in l:
            soup = BeautifulSoup(s, "html.parser")
            link = soup.a
            print(link["href"] + "\t" + link.get_text())
    ...:     
http://example.com  hello, world
http://example.com  hello, world

Regular Expression to extract all the links and the corresponding link text

2 Answers2