Using BeautifulSoup to find all links and group by link target (href)

Question

I am using the BeautifulSoup package to parse an HTML body to search for all <a> tags. What I am trying to do is gather all links, and group them by the <a> target (href).

For example: if http://www.google.com is listed twice in the HTML body, then I need to group the links together and list the <a>'s data-name attribute. (data-name is something added in by my editor for when the user names their link(s)).

def extract_links_from_mailing(mailing):
    content = "%s %s" % (mailing.html_body, mailing.plaintext)
    pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    links = []

    soup = BeautifulSoup(content, "html5lib")

    for link in soup.findAll('a'):
        if not link.get('no_track'):
            target = link.get('href')
            name = link.get('data-name')
            link_text = unicode(link)

            if any([
                not target,
                'example.net' in target,
                target.startswith('mailto'),
                '{' in target,
                target.startswith('#')
            ]):
                continue

            target = pattern.search(target)

            # found a target and the target isn't already apart of the list
            if target and not any(l['target'] == target.group() for l in links):
                links.append({
                    'name': name,
                    'target': target.group()
                })

    return links

The above output looks like:

[
    {
        "name": "Goog 1",
        "target": "https://www.google.com"
    },
    {
        "name": "Yahoo!",
        "target": "http://www.yahoo.com"
    },
    {
        "name": "Goog 2",
        "target": "https://www.google.com"
    }
]

What I am trying to achieve:

[
    {
        "target": "https://www.google.com",
        "names": ["Goog 1", "Goog 2"]
    },
    {
        "target": "http://www.yahoo.com",
        "names": ["Yahoo!"]
    },
]

Is the order of targets important? – alecxe Mar 16 '15 at 16:18 — alecxe, Mar 16 '15 at 16:18

alecxe · Answer 1 · 2015-03-16T16:28:03.193

0

You can use collections.defaultdict to group the targets:

from collections import defaultdict 

links = defaultdict(set)
for link in soup.findAll('a'):
    ...

    if target:
        links[target.group()].add(name)

As a result, links would contain a dictionary where the keys would be targets and values - sets of names.

edited Mar 16 '15 at 16:28

answered Mar 16 '15 at 16:20

alecxe

462,703
120
1,088
1,195

Where does the `list` variable get set? – dennismonsewicz Mar 16 '15 at 16:22
@dennismonsewicz this is a built-in `list` keyword. – alecxe Mar 16 '15 at 16:22
Im getting this error: `"error": "'_sre.SRE_Match' object is not subscriptable"` – dennismonsewicz Mar 16 '15 at 16:24
@dennismonsewicz oops, sure, fixed. – alecxe Mar 16 '15 at 16:28
I changed it to that and now I am getting `string indices must be integer` – dennismonsewicz Mar 16 '15 at 16:28
@dennismonsewicz replaced the uniqueness check with a `defaultdict(set)`, check it out. – alecxe Mar 16 '15 at 16:31
So, I changed it to your suggestion and changed my `return` value to be `return list(links)` (was getting that `sets` aren't JSON serializable). But now the output is just a list of all of the target href's and no names – dennismonsewicz Mar 16 '15 at 16:37
@dennismonsewicz nono, when you call `list(links)` - you are just getting a list of keys - targets. – alecxe Mar 16 '15 at 17:05
Oops my mistake... changed it back and am now getting this error: `set([u'HTTP Goog']) is not JSON serializable` – dennismonsewicz Mar 16 '15 at 17:11
@dennismonsewicz what are you trying to do with `links` afterwards? Thanks. – alecxe Mar 16 '15 at 17:39
I am returning them as an API endpoint in an application I am building – dennismonsewicz Mar 16 '15 at 17:40
@dennismonsewicz you need to learn how to dump the structure with sets inside into JSON, please see http://stackoverflow.com/questions/8230315/python-sets-are-not-json-serializable. – alecxe Mar 16 '15 at 17:44
After thinking about this, I am going to need the API endpoint to return a list of objects that contain a `target` attribute and a `names` attribute – dennismonsewicz Mar 17 '15 at 19:21

Using BeautifulSoup to find all links and group by link target (href)

1 Answers1