I am using the BeautifulSoup
package to parse an HTML body to search for all <a>
tags. What I am trying to do is gather all links, and group them by the <a>
target (href).
For example: if http://www.google.com
is listed twice in the HTML body, then I need to group the links together and list the <a>
's data-name
attribute. (data-name
is something added in by my editor for when the user names their link(s)).
def extract_links_from_mailing(mailing):
content = "%s %s" % (mailing.html_body, mailing.plaintext)
pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
links = []
soup = BeautifulSoup(content, "html5lib")
for link in soup.findAll('a'):
if not link.get('no_track'):
target = link.get('href')
name = link.get('data-name')
link_text = unicode(link)
if any([
not target,
'example.net' in target,
target.startswith('mailto'),
'{' in target,
target.startswith('#')
]):
continue
target = pattern.search(target)
# found a target and the target isn't already apart of the list
if target and not any(l['target'] == target.group() for l in links):
links.append({
'name': name,
'target': target.group()
})
return links
The above output looks like:
[
{
"name": "Goog 1",
"target": "https://www.google.com"
},
{
"name": "Yahoo!",
"target": "http://www.yahoo.com"
},
{
"name": "Goog 2",
"target": "https://www.google.com"
}
]
What I am trying to achieve:
[
{
"target": "https://www.google.com",
"names": ["Goog 1", "Goog 2"]
},
{
"target": "http://www.yahoo.com",
"names": ["Yahoo!"]
},
]