Replacing contents of hrefs not prefix with http and https

Question

What I want to do is to replace href="..." with href="abc/...". Except that when ... is http:// and https://

I have successfully done the first part, but I could not find a way to detect http:// and https://, following is the codes:

line='<a href="img/a.html"/>'
print re.sub(r'href="([^<#][^"]*)"',r'href="abc/\1"', line)
//Correct Output: <a href="abc/img/a.html"/>

line='<a href="http://google.com"/>'
print re.sub(r'href="([^<#][^"]*)"',r'href="abc/\1"', line)
//WrongOutput: <a href="abc/http://google.com"/>

It seems like you're manipulating HTML with regex. A nice read: http://stackoverflow.com/a/1732454/948550 . Can you port the task to something that's meant to parse HTML like: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ ? — Reut Sharabani, Dec 14 '14 at 09:58

Avinash Raj · Answer 1 · 2014-12-14T10:38:59.460

2

Through BeautifulSoup,

>>> import re
>>> from bs4 import BeautifulSoup
>>> s = """<a href="img/a.html"/>
<a href="http://google.com"/>"""
>>> soup = BeautifulSoup(s)
>>> for i in soup.select('a'):
        if re.match(r'(?!https?://)', i['href']):
            i['href'] = 'abc/' + i['href']


>>> print(soup)
<html><body><a href="abc/img/a.html"></a>
<a href="http://google.com"></a></body></html>

OR

No, need for regex here.

>>> for i in soup.select('a'):
        if not i['href'].startswith('http://') or i['href'].startswith('https://'):
            i['href'] = 'abc/' + i['href']


>>> print(soup)
<html><body><a href="abc/img/a.html"></a>
<a href="http://google.com"></a></body></html>

OR

>>> for i in soup.select('a'):
        if not i['href'].startswith(('http://', 'https://')):
            i['href'] = 'abc/' + i['href']


>>> soup
<html><body><a href="abc/img/a.html"></a>
<a href="http://google.com"></a></body></html>

edited Dec 14 '14 at 10:38

answered Dec 14 '14 at 10:15

Avinash Raj

172,303
28
230
274

simple `if "https" in i['href']` of if `i['href'].startswith("https")` does away with any need for regex – Padraic Cunningham Dec 14 '14 at 10:21
You can remove some indentation and factor out regex here... `for i in soup('a', href=lambda L: not L.startswith(('http://', 'https://'))):`... – Jon Clements Dec 14 '14 at 10:21
@AvinashRaj if you're going with `.startswith` - note that it takes a tuple as per my comment - saves the `or` and typing out `i['href'].startswith` twice... – Jon Clements Dec 14 '14 at 10:37

nu11p01n73R · Answer 2 · 2014-12-14T10:07:58.130

You can use look arounds as

>>> line='<a href="img/a.html"/>'
>>> re.sub(r'(?<=href=")(?!https?)',r'abc/', line)
'<a href="abc/img/a.html"/>'

>>> line='<a href="http://google.com"/>'
>>> re.sub(r'(?<=href=")(?!https?)',r'abc/', line)
'<a href="http://google.com"/>'

(?<=href=") positive look behind checks if the string position is presceded by href="
(?!https?) negative look ahead. Checks if the position, after the href=" is not followed by http or https

score -1 · Answer 3 · edited May 23 '17 at 12:12

-1

This is for those who can port their task to HTML-parsing libraries (like BeautifulSoup)

import bs4

# this adds some content to create a valid doc, we'll ignore it
# since we don't need it
element = bs4.BeautifulSoup('<a href="img/a.html"/>')
print element

element.a['href'] = 'abc/' + element.a['href']
# link has changed - print element tag
print element.a

# to get the string simply cast to string
print str(element.a)
# prints: <a href="abc/img/a.html"></a>

Bonus read on parsing HTML with regex.

edited May 23 '17 at 12:12

Community

1
1

answered Dec 14 '14 at 10:05

Reut Sharabani

30,449
6
70
88

2

while beautifulsoup is a much better option to parse html, your code does not answer the question – Padraic Cunningham Dec 14 '14 at 10:10

Replacing contents of hrefs not prefix with http and https

3 Answers3