0

I am extracting URLs from a set of raw data and I intend to do this using python regular expressions.

I tried

(http.+)

But it just got the entire part starting from http.

Input

href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone

https://vine.co/v/i6iIrBwnTFI

Expected Output

http://twitter.com/download/iphone

https://vine.co/v/i6iIrBwnTFI

Command
  • 515
  • 2
  • 6
  • 19

2 Answers2

0

Try this: http[^\"^\s]*

This assumes all your links will start with http and it will break the expression if it encounters a whitespace or a "

Here is how you could use it:

import re
regexp = '''http[^\"^\s]*'''
urls = '''href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone https://vine.co/v/i6iIrBwnTFI'''
output = re.findall(regexp, urls)
output

['http://twitter.com/download/iphone', 'https://vine.co/v/i6iIrBwnTFI']

HakunaMaData
  • 1,281
  • 12
  • 26
0

First, u should find what-characters-are-valid-in-a-url

Then, the regular expression could be:

(http://|https://)([a-zA-Z0-9\-\._~:/\?\#\[\]@!$&'\(\)\*\+,;=]+)

In my python interpreter, it looks like:

>>> import re
>>> regexp = '''(http://|https://)([a-zA-Z0-9\-\._~:/\?\#\[\]@!$&'\(\)\*\+,;=]+)'''
>>> url = '''href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone https://vine.co/v/i6iIrBwnTFI'''
>>> r = re.findall(regexp, url)
>>> r
[('http://', 'twitter.com/download/iphone'), ('https://', 'vine.co/v/i6iIrBwnTFI')]
>>> [x[0]+x[1] for x in r]
['http://twitter.com/download/iphone', 'https://vine.co/v/i6iIrBwnTFI']
Bob Fred
  • 211
  • 1
  • 6