How to extract only the URL from the following strings using regular expressions?

Question

I am extracting URLs from a set of raw data and I intend to do this using python regular expressions.

I tried

(http.+)

But it just got the entire part starting from http.

Input

href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone

https://vine.co/v/i6iIrBwnTFI

Expected Output

http://twitter.com/download/iphone

https://vine.co/v/i6iIrBwnTFI

HakunaMaData · Accepted Answer · 2018-12-29T06:18:39.270

0

Try this: http[^\"^\s]*

This assumes all your links will start with http and it will break the expression if it encounters a whitespace or a "

Here is how you could use it:

import re
regexp = '''http[^\"^\s]*'''
urls = '''href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone https://vine.co/v/i6iIrBwnTFI'''
output = re.findall(regexp, urls)
output

['http://twitter.com/download/iphone', 'https://vine.co/v/i6iIrBwnTFI']

edited Dec 29 '18 at 06:18

answered Dec 29 '18 at 04:05

HakunaMaData

1,281
12
26

Do you mean (http[^\"^\s]*) – Command Dec 29 '18 at 04:25
Added some code that explains how you can use the regex – HakunaMaData Dec 29 '18 at 06:19

Bob Fred · Answer 2 · 2018-12-29T04:25:42.470

First, u should find what-characters-are-valid-in-a-url

Then, the regular expression could be:

(http://|https://)([a-zA-Z0-9\-\._~:/\?\#\[\]@!$&'\(\)\*\+,;=]+)

In my python interpreter, it looks like:

>>> import re
>>> regexp = '''(http://|https://)([a-zA-Z0-9\-\._~:/\?\#\[\]@!$&'\(\)\*\+,;=]+)'''
>>> url = '''href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone https://vine.co/v/i6iIrBwnTFI'''
>>> r = re.findall(regexp, url)
>>> r
[('http://', 'twitter.com/download/iphone'), ('https://', 'vine.co/v/i6iIrBwnTFI')]
>>> [x[0]+x[1] for x in r]
['http://twitter.com/download/iphone', 'https://vine.co/v/i6iIrBwnTFI']

How to extract only the URL from the following strings using regular expressions?

2 Answers2