Expression to remove URL links from Twitter tweet

Question

I simply would like to find and replace all occurrences of a twitter url in a string (tweet):

Input:

This is a tweet with a url: http://t.co/0DlGChTBIx

Output:

This is a tweet with a url:

I've tried this:

p=re.compile(r'\<http.+?\>', re.DOTALL)
tweet_clean = re.sub(p, '', tweet)

Might be helpful http://stackoverflow.com/questions/520031/whats-the-cleanest-way-to-extract-urls-from-a-string-using-python — Kemal Fadillah, Jun 25 '14 at 03:48
For this specific case you can do: `your_string.replace('http://t.co/0DlGChTBIx','')` — Marcin, Jun 25 '14 at 03:49
I've tried a bunch of different regex expressions not working... — hagope, Jun 25 '14 at 03:52
http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx for url regex — karthikr, Jun 25 '14 at 03:54

score 70 · Accepted Answer · answered Jun 25 '14 at 03:51

70

Do this:

result = re.sub(r"http\S+", "", subject)

http matches literal characters
\S+ matches all non-whitespace characters (the end of the url)
we replace with the empty string

answered Jun 25 '14 at 03:51

zx81

41,100
9
89
105

It will remove any string prefix with http e.g. httparty – Kaustuv Dec 28 '19 at 13:38
What is `re`, `subject` and is `result` a constant? – Dimitri Kopriwa Dec 21 '20 at 10:12

score 2 · Answer 2 · answered Jun 25 '14 at 03:59

2

The following regex will capture two matched groups: the first includes everything in the tweet until the url and the second will catch everything that will come after the URL (empty in the example you posted above):

import re
str = 'This is a tweet with a url: http://t.co/0DlGChTBIx'
clean_tweet = re.match('(.*?)http.*?\s?(.*?)', str)
if clean_tweet: 
    print clean_tweet.group(1)
    print clean_tweet.group(2) # will print everything after the URL

answered Jun 25 '14 at 03:59

Nir Alfasi

53,191
11
86
129

1

You asked the output to be: "This is a tweet with a url:" and that's what it does. Do you want to extract only the URL ? – Nir Alfasi Aug 21 '14 at 22:17
1

string = "niki minaj - anaconda:hhttp://youtu.be/LDZX4ooRsWs @tv1 #niki this is cool!" ( trying to take out the url, the (at)mention and the #hashtag and leave everything else ) – sirvon Aug 21 '14 at 22:22
@sirvon can you add this test-case with an expected output to the question? – Nir Alfasi Aug 22 '14 at 00:47

score 2 · Answer 3 · answered Jul 04 '20 at 13:40

you can use:

text = 'Amazing save #FACup #zeebox https://stackoverflow.com/tiUya56M Ok'
text = re.sub(r'https?:\/\/\S*', '', text, flags=re.MULTILINE)

# output: 'Amazing save #FACup #zeebox  Ok'

r The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'
? Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. https? will match either ‘http’ or ‘https’.
https?:\/\/ will match any "http://" and "https://" in string
\S Returns a match where the string DOES NOT contain a white space character
* Zero or more occurrences

score 1 · Answer 4 · answered Jun 25 '14 at 04:35

You could try the below re.sub function to remove URL link from your string,

>>> str = 'This is a tweet with a url: http://t.co/0DlGChTBIx'
>>> m = re.sub(r':.*$', ":", str)
>>> m
'This is a tweet with a url:'

It removes everything after first : symbol and : in the replacement string would add : at the last.

This would prints all the characters which are just before to the : symbol,

>>> m = re.search(r'^.*?:', str).group()
>>> m
'This is a tweet with a url:'

score 1 · Answer 5 · answered Feb 17 '20 at 17:36

1

text = re.sub(r"https:(\/\/t\.co\/([A-Za-z0-9]|[A-Za-z]){10})", "", text)

This matches alphanumerics too after t.co/

answered Feb 17 '20 at 17:36

self.Fool

302
2
14

score 0 · Answer 6 · edited Jun 14 '18 at 10:11

0

Try using this:

text = re.sub(r"http\S+", "", text)

edited Jun 14 '18 at 10:11

alexander.polomodov

5,396
14
39
46

answered Jun 14 '18 at 09:43

Garima Rawat

21
2
5

score 0 · Answer 7 · answered Jun 17 '19 at 13:33

0

clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)

while (clean_tweet):
content = clean_tweet.group(1) + " " + clean_tweet.group(3)
clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)

answered Jun 17 '19 at 13:33

nancy agarwal

1
1

score 0 · Answer 8 · answered May 31 '21 at 03:27

0

I found this solution:

text = re.sub(r'https?://\S+|www\.\S+', '', text)

answered May 31 '21 at 03:27

Julia Stanina

41
2

Expression to remove URL links from Twitter tweet

8 Answers8