How to remove any URL within a string in Python

Question

I want to remove all URLs inside a string (replace them with "") I searched around but couldn't really find what I want.

Example:

text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6
http://url.com/bla3/blah3/

I want the result to be:

text1
text2
text3
text4
text5
text6

Are you sure you've researched sufficiently? Have you tried **regular expressions**? — Abhranil Das, Jul 04 '12 at 15:32
Yes but I didn't really understand how to do it in my example.. — Taha, Jul 04 '12 at 15:34
Have you looked at http://stackoverflow.com/questions/520031/whats-the-cleanest-way-to-extract-urls-from-a-string-using-python — Matthew Adams, Jul 04 '12 at 15:41

score 113 · Answer 1 · answered Nov 26 '16 at 21:01

113

the shortest way

re.sub(r'http\S+', '', stringliteral)

answered Nov 26 '16 at 21:01

tolgayilmaz

3,987
2
19
19

1

This will also remove 'httpabc' and 'abchttp'. – Louis Yang Oct 06 '18 at 00:52
5

@LouisYang huh? it shouldn't (and doesn't; at least on 3.7) remove abchttp. You'd have to use `.*http` or something like that. BTW, I'd suggest `r'https?://\S+'`. – Igor Hatarist Feb 16 '19 at 00:53
this is the best solution and should be marked as the right answer – Henley n Jun 23 '20 at 21:29
1

You can also write it like `text = re.sub(r"\S*https?:\S*", "", text)` to remove the https even if they're in paranthesis or brackets. – mitra mirshafiee Mar 12 '21 at 11:44
1

@henley, above code did not work for my text: '''$0.29 non-gaap diluted income per share. [$29 million after tax] on the revaluation, http : //www.businesswire.com/news/home/20210217005928/en .-AAAAA-santa clara,""" – tursunWali Mar 20 '21 at 02:53
@tursunWali i think it's because of the space in your link but i'm not sure? – Henley n Mar 21 '21 at 05:43
The above solution will also remove 'httpabc' etc, to overcome that, you can use something like `re.sub(r'http\S*:{1}/{2}\S+', '', sentence)` – MasterBlasterCoder Nov 14 '22 at 08:16
same here. didn't work on several tweets when used in a function. ```re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)``` worked – Simone Mar 29 '23 at 09:35

Ωmega · Accepted Answer · 2012-07-04T16:21:37.277

94

Python script:

import re
text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)

Output:

text1
text2
text3
text4
text5
text6

Test this code here.

edited Jul 04 '12 at 16:21

answered Jul 04 '12 at 16:15

Ωmega

42,614
34
134
203

10

This solution assumes that any URL is immediately follows by a new line (which is the case in the OP's example, but just FYI). tolgayilmaz's [regular expression](https://stackoverflow.com/a/40823105/395857) doesn't have this potential shortcoming. – Franck Dernoncourt Apr 10 '18 at 21:07
@FranckDernoncourt Interesting because this was not the case for the twitter dataset I am working with. Above code removed all urls despite them not being immediately followed by a new line – Simone Mar 29 '23 at 09:37

score 30 · Answer 3 · answered Jul 04 '12 at 16:12

This worked for me:

import re
thestring = "text1\ntext2\nhttp://url.com/bla1/blah1/\ntext3\ntext4\nhttp://url.com/bla2/blah2/\ntext5\ntext6"

URLless_string = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', thestring)
print URLless_string

Result:

text1
text2

text3
text4

text5
text6

score 19 · Answer 4 · answered Apr 26 '18 at 06:48

19

Removal of HTTP links/URLs mixed up in any text:

import re
re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', " ", text)

answered Apr 26 '18 at 06:48

Pranzell

2,275
16
21

2

This method hangs for me when parsing a string with '[]()'. Any idea why? – Vid Stropnik Jun 24 '20 at 15:31
1

above method worked for me. I think this is most comprehensive solution. – tursunWali Mar 20 '21 at 03:01
just a stern warning, i attempted to use this exact regex to remove URLS from a ve...eeery long text in ONE , just ONE record in swifter + pandas dataframe. and after waiting for hours it didn't seem to end... then I used this one https://stackoverflow.com/a/38498442/1465073, and the same text took like a fraction of a second to finish. I lost like 12-24 hours worth of work just trying to figure out what was happening. No errors, no warnings, just my apply function seemingly frozen for hours. Im using 13700K, 2 x 16 DDR5 6000, RTX 4090. The same issue manifested in Azure cloud A100 and V100s – user1465073 Mar 07 '23 at 17:32

score 17 · Answer 5 · answered Jul 21 '16 at 08:05

17

This solution caters for http, https and the other normal url type special characters :

import re
def remove_urls (vTEXT):
    vTEXT = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', vTEXT, flags=re.MULTILINE)
    return(vTEXT)


print( remove_urls("this is a test https://sdfs.sdfsdf.com/sdfsdf/sdfsdf/sd/sdfsdfs?bob=%20tree&jef=man lets see this too https://sdfsdf.fdf.com/sdf/f end"))

answered Jul 21 '16 at 08:05

Lee Martin

171
1
3

1

It doesn't work if the URL content a hyphen, e.g. `print(remove_urls("this https://sdfs-sdfsdf.com yo"))` -> `this is a test -sdfsdf.com yo` – Franck Dernoncourt Apr 10 '18 at 20:54
1

Use this instead (r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%|\-)*\b' – Yash Gupta Jan 21 '21 at 18:03
I like the idea of this, but why is `(http|https)` optional? Do any URLs begin with `://`? I have had decent success with `(https|http|ftp):\/\/\S+` – oelna Jun 18 '21 at 15:45

Gabriel Giraldo-Wingler · Answer 6 · 2018-09-10T01:51:53.467

15

I wasn't able to find any that handled my particular situation, which was removing urls in the middle of tweets that also have whitespaces in the middle of urls so I made my own:

(https?:\/\/)(\s)*(www\.)?(\s)*((\w|\s)+\.)*([\w\-\s]+\/)*([\w\-]+)((\?)?[\w\s]*=\s*[\w\%&]*)*

here's an explanation:
(https?:\/\/) matches http:// or https://
(\s)* optional whitespaces
(www\.)? optionally matches www.
(\s)* optionally matches whitespaces
((\w|\s)+\.)* matches 0 or more of one or more word characters followed by a period
([\w\-\s]+\/)* matches 0 or more of one or more words(or a dash or a space) followed by '\'
([\w\-]+) any remaining path at the end of the url followed by an optional ending
((\?)?[\w\s]*=\s*[\w\%&]*)* matches ending query params (even with white spaces,etc)

test this out here:https://regex101.com/r/NmVGOo/8

edited Sep 10 '18 at 01:51

answered Aug 16 '18 at 20:20

Gabriel Giraldo-Wingler

190
2
8

1

Please edit your answer to include the explanation. Links can go dead. – mypetlion Aug 16 '18 at 20:39
1

@Gabriel, I have modified your code a little bit so that it works for both http and https: (?:(https|http)\s?:\/\/)(\s)*(www\.)?(\s)*((\w|\s)+\.)*([\w\-\s]+\/)*([\w\-]+)((\?)?[\w\s]*=\s*[\w\%&]*)* – tursunWali Mar 20 '21 at 03:22
@tursunWali it already works for http and https, please see the attached testing link. Thank you – Gabriel Giraldo-Wingler Apr 09 '21 at 21:55

Samuel Nde · Answer 7 · 2019-03-29T19:06:56.857

13

What you really want to do is to remove any string that starts with either http:// or https:// plus any combination of non white space characters. Here is how I would solve it. My solution is very similar to that of @tolgayilmaz

#Define the text from which you want to replace the url with "".
text ='''The link to this post is https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python'''

import re
#Either use:
re.sub('http://\S+|https://\S+', '', text)
#OR 
re.sub('http[s]?://\S+', '', text)

And the result of running either code above is

>>> 'The link to this post is '

I prefer the second one because it is more readable.

edited Mar 29 '19 at 19:06

answered Jan 15 '19 at 20:42

Samuel Nde

2,565
2
23
23

what if there is white space characters like : http : //www.businesswire.com/news/home/20210217005928/en . – tursunWali Mar 20 '21 at 03:02
1

Hmmm! will that still be a url? – Samuel Nde Mar 20 '21 at 21:40
if we want to remove such word groups from file, then what to do? (if I modify my question) – tursunWali Mar 21 '21 at 04:30
I think that would be a new and different question. Not the one being asked here. My answer is for the question that was posted here. – Samuel Nde Mar 21 '21 at 20:53

mounirboulwafa · Answer 8 · 2021-01-06T19:46:45.153

8

In order to remove any URL within a string in Python, you can use this RegEx function :

import re

def remove_URL(text):
    """Remove URLs from a text string"""
    return re.sub(r"http\S+", "", text)

edited Jan 06 '21 at 19:46

answered Aug 28 '20 at 11:55

mounirboulwafa

1,587
17
18

1

this is should be the best answer! – rizkidzulkarnain Oct 06 '21 at 01:25

Nischit Pradhan · Answer 9 · 2018-03-13T14:36:25.920

7

I know this has already been answered and its stupid late but I think this should be here. This is a regex that matches any kind of url.

[^ ]+\.[^ ]+

It can be used like

re.sub('[^ ]+\.[^ ]+','',sentence)

edited Mar 13 '18 at 14:36

answered Mar 13 '18 at 13:39

Nischit Pradhan

440
6
18

This is only a regex, this does not replace anything and thus this isn't answering the question. – André Kool Mar 13 '18 at 13:58
@AndréKool this is for matching any kind of url. for replacing there are already alot of answers above – Nischit Pradhan Mar 13 '18 at 14:30
In that case i suggest you [edit](https://stackoverflow.com/posts/49257661/edit) your answer to explain that to avoid any confusion. – André Kool Mar 13 '18 at 14:32
This worked for me! Thanks! It is a very elegant and eficient solution to match url starting with and without http(s) and www. – tmsss Jul 26 '18 at 16:21
indeed , very elegant and simple solution, covers ALL cases, thank you man – tursunWali Aug 09 '21 at 04:42

score 6 · Answer 10 · answered Jul 04 '12 at 16:48

6

You could also look at it from the other way around...

from urlparse import urlparse
[el for el in ['text1', 'FTP://somewhere.com', 'text2', 'http://blah.com:8080/foo/bar#header'] if not urlparse(el).scheme]

answered Jul 04 '12 at 16:48

Jon Clements

138,671
33
247
280

Shailesh Wadhwa · Answer 11 · 2017-09-02T14:42:12.967

3

The following regular expression in Python works well for detecting URL(s) in the text:

source_text = '''
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6    '''

import re
url_reg  = r'[a-z]*[:.]+\S+'
result   = re.sub(url_reg, '', source_text)
print(result)

Output:

text1
text2

text3
text4

text5
text6

edited Sep 02 '17 at 14:42

answered Sep 02 '17 at 14:19

Shailesh Wadhwa

31
3

1

The question was answered 5 years ago. What new value does your answer bring? – Maciej Jureczko Sep 02 '17 at 14:23
This will delete lines like `text1:text2`, that is not wanted. – Toto Sep 02 '17 at 14:45

score 1 · Answer 12 · edited Jul 12 '22 at 08:17

1

why do not use this its so complete

i = re.sub(r"(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)","",i)

edited Jul 12 '22 at 08:17

FObersteiner

22,500
8
42
72

answered Jul 12 '22 at 04:41

Ilya

1
5
18

score 0 · Answer 13 · answered Nov 05 '19 at 06:07

0

import re
s = '''
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6
http://url.com/bla3/blah3/'''
g = re.findall(r'(text\d+)',s)
print ('list',g)
for i in g:
    print (i)

Out

list ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']
text1
text2
text3
text4
text5
text6

answered Nov 05 '19 at 06:07

1

The text is just an example, not a keyword. It can be any sentence or word. – Fatemeh Rahimi Jan 02 '21 at 01:56

score 0 · Answer 14 · answered Aug 11 '21 at 09:21

I think the most general URL regex pattern is this one:

URL_PATTERN = r'[A-Za-z0-9]+://[A-Za-z0-9%-_]+(/[A-Za-z0-9%-_])*(#|\\?)[A-Za-z0-9%-_&=]*'

There is a small module that does what do you want:

pip install mysmallutils

from mysutils.text import remove_urls

remove_urls(text)

TBhavnani · Answer 15 · 2021-09-26T10:23:32.310

0

A simple .* with a positive look behind should do the job.

text="text1\ntext2\nhttp://url.com/bla1/blah1/\ntext3\ntext4\nhttp://url.com/bla2/blah2/\ntext5\ntext6"

req=re.sub(r'http.*?(?=\s)', " ", text)
print(req)

edited Sep 26 '21 at 10:23

answered Sep 20 '21 at 07:07

TBhavnani

721
7
12

How to remove any URL within a string in Python

15 Answers15

Removal of HTTP links/URLs mixed up in any text:

Linked