31

I simply would like to find and replace all occurrences of a twitter url in a string (tweet):

Input:

This is a tweet with a url: http://t.co/0DlGChTBIx

Output:

This is a tweet with a url:

I've tried this:

p=re.compile(r'\<http.+?\>', re.DOTALL)
tweet_clean = re.sub(p, '', tweet)
Ria
  • 10,237
  • 3
  • 33
  • 60
hagope
  • 5,523
  • 7
  • 38
  • 52

8 Answers8

70

Do this:

result = re.sub(r"http\S+", "", subject)
  • http matches literal characters
  • \S+ matches all non-whitespace characters (the end of the url)
  • we replace with the empty string
zx81
  • 41,100
  • 9
  • 89
  • 105
2

The following regex will capture two matched groups: the first includes everything in the tweet until the url and the second will catch everything that will come after the URL (empty in the example you posted above):

import re
str = 'This is a tweet with a url: http://t.co/0DlGChTBIx'
clean_tweet = re.match('(.*?)http.*?\s?(.*?)', str)
if clean_tweet: 
    print clean_tweet.group(1)
    print clean_tweet.group(2) # will print everything after the URL 
Nir Alfasi
  • 53,191
  • 11
  • 86
  • 129
  • 1
    You asked the output to be: "This is a tweet with a url:" and that's what it does. Do you want to extract only the URL ? – Nir Alfasi Aug 21 '14 at 22:17
  • 1
    string = "niki minaj - anaconda:hhttp://youtu.be/LDZX4ooRsWs @tv1 #niki this is cool!" ( trying to take out the url, the (at)mention and the #hashtag and leave everything else ) – sirvon Aug 21 '14 at 22:22
  • @sirvon can you add this test-case with an expected output to the question? – Nir Alfasi Aug 22 '14 at 00:47
2

you can use:

text = 'Amazing save #FACup #zeebox https://stackoverflow.com/tiUya56M Ok'
text = re.sub(r'https?:\/\/\S*', '', text, flags=re.MULTILINE)

# output: 'Amazing save #FACup #zeebox  Ok'
  • r The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'
  • ? Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. https? will match either ‘http’ or ‘https’.
  • https?:\/\/ will match any "http://" and "https://" in string
  • \S Returns a match where the string DOES NOT contain a white space character
  • * Zero or more occurrences
Mohammad Nazari
  • 2,535
  • 1
  • 18
  • 29
1

You could try the below re.sub function to remove URL link from your string,

>>> str = 'This is a tweet with a url: http://t.co/0DlGChTBIx'
>>> m = re.sub(r':.*$', ":", str)
>>> m
'This is a tweet with a url:'

It removes everything after first : symbol and : in the replacement string would add : at the last.

This would prints all the characters which are just before to the : symbol,

>>> m = re.search(r'^.*?:', str).group()
>>> m
'This is a tweet with a url:'
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
1
text = re.sub(r"https:(\/\/t\.co\/([A-Za-z0-9]|[A-Za-z]){10})", "", text)

This matches alphanumerics too after t.co/

self.Fool
  • 302
  • 2
  • 14
0

Try using this:

text = re.sub(r"http\S+", "", text)
alexander.polomodov
  • 5,396
  • 14
  • 39
  • 46
Garima Rawat
  • 21
  • 2
  • 5
0

clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)

while (clean_tweet):
content = clean_tweet.group(1) + " " + clean_tweet.group(3)
clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)

0

I found this solution:

text = re.sub(r'https?://\S+|www\.\S+', '', text)