Strip everything but URL from a string in python

Question

I'm grabbing a series of links from a website with python and BS4 but i need to clean them up so I only get the URL in the string.

the links i get look like this:

javascript:changeChannel('http://some-server.com/with1234init.also', 20);

and i need it to look like this

http://some-server.com/with1234init.also

Are all strings of the exact same format, or are there corner cases in the HTML that may cause simple rules to fail? — jozxyqk, Feb 20 '14 at 11:48
I forgot to mention that all the links i grab is different. They all start with the javascript:changeChannel(' part but the urls are different and the end after the last ' is also different in all of the links — user3332151, Feb 20 '14 at 13:45

score 1 · Accepted Answer · answered Feb 20 '14 at 10:32

1

Well, if all the links are like that one you can do it with a very simple approach:

s.split("'")[1]

For example:

>>>s="javascript:changeChannel('http://some-server.com/with1234init.also', 20);"
>>>s.split("'")
['javascript:changeChannel(',
 'http://some-server.com/with1234init.also',
 ', 20);']

answered Feb 20 '14 at 10:32

Paulo Bu

29,294
6
74
73

True, and I was about to post this, however, it does not give you something exact. Perhaps, you can do this and _then_ do a search with a regex to determine the index value. – Games Brainiac Feb 20 '14 at 10:34
Well, if all the strings are formatted the same this will probably work well for everyone. What is the case you say is not exact? – Paulo Bu Feb 20 '14 at 10:35
For example, there couple be more than just 2 single quotes in the line. In essence, this solution will only work for this problem but does not solve the issue at large. – Games Brainiac Feb 20 '14 at 10:45
@GamesBrainiac you're right. The solution is very domain specific. I explained in the answer that all strings _needed_ to be with the same format. But if they are, I think is worth doing it because is very simple. – Paulo Bu Feb 20 '14 at 10:49
Indeed, but I was hoping you knew some way to capture a URL (heh) using regex. I've been trying to make one myself, but I fail most of the time. – Games Brainiac Feb 20 '14 at 10:51
URL matching regex is huge. I thought about it but due to the simplicity of the environment I suggested this kind of _naive_ approach :) – Paulo Bu Feb 20 '14 at 11:20
@GamesBrainiac does my answer with a regex not work? All I got for the 1 min google was a single downvote :P – jozxyqk Feb 20 '14 at 11:25
@jozxyqk I'm afraid it does not, did you just copy it from gskinner's regexr? – Games Brainiac Feb 20 '14 at 11:27
@GamesBrainiac Works fine for me (python 2.7.5). Copied near-verbatim form [here](http://daringfireball.net/2010/07/improved_regex_for_matching_urls). I had to escape the `'`. – jozxyqk Feb 20 '14 at 11:33

score 0 · Answer 2 · answered Feb 20 '14 at 10:32

0

 str = javascript:changeChannel('http://some-server.com/with1234init.also', 20);
 formattedtext  ="http://" + str.split("http://")[1].split(',')[0].strip("'")

answered Feb 20 '14 at 10:32

MONTYHS

926
1
7
30

score 0 · Answer 3 · edited May 23 '17 at 12:28

A reasonably robust way is to take your chunk of text and search it with a URL-matching regex pattern.

See also:

Python regular expression again - match url
which links to here: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
Extracting URL link using regular expression re - string matching - Python

Using regex...

import re
re.search(pattern, text)
... or
re.findall(pattern, text)

A full example...

>>> p = re.compile(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))')
or
>>> p = re.compile('(?i)\\b((?:https?://|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:\\\'".,<>?\xc2\xab\xc2\xbb\xe2\x80\x9c\xe2\x80\x9d\xe2\x80\x98\xe2\x80\x99]))')

>>> m = p.search("javascript:changeChannel('http://some-server.com/with1234init.also', 20);")
>>> m.group()
'http://some-server.com/with1234init.also'

the pattern used is from the web URL version in the above link

Note the use of the r prefix and the escaped ' quote towards the end in the first pattern.
using re.compile caches the regex pattern

Strip everything but URL from a string in python

3 Answers3