1

I'm grabbing a series of links from a website with python and BS4 but i need to clean them up so I only get the URL in the string.

the links i get look like this:

javascript:changeChannel('http://some-server.com/with1234init.also', 20);

and i need it to look like this

http://some-server.com/with1234init.also

user3332151
  • 39
  • 1
  • 7
  • 1
    what is your attempt? – spiehr Feb 20 '14 at 10:31
  • Are all strings of the exact same format, or are there corner cases in the HTML that may cause simple rules to fail? – jozxyqk Feb 20 '14 at 11:48
  • I forgot to mention that all the links i grab is different. They all start with the javascript:changeChannel(' part but the urls are different and the end after the last ' is also different in all of the links – user3332151 Feb 20 '14 at 13:45

3 Answers3

1

Well, if all the links are like that one you can do it with a very simple approach:

s.split("'")[1]

For example:

>>>s="javascript:changeChannel('http://some-server.com/with1234init.also', 20);"
>>>s.split("'")
['javascript:changeChannel(',
 'http://some-server.com/with1234init.also',
 ', 20);']
Paulo Bu
  • 29,294
  • 6
  • 74
  • 73
  • True, and I was about to post this, however, it does not give you something exact. Perhaps, you can do this and _then_ do a search with a regex to determine the index value. – Games Brainiac Feb 20 '14 at 10:34
  • Well, if all the strings are formatted the same this will probably work well for everyone. What is the case you say is not exact? – Paulo Bu Feb 20 '14 at 10:35
  • For example, there couple be more than just 2 single quotes in the line. In essence, this solution will only work for this problem but does not solve the issue at large. – Games Brainiac Feb 20 '14 at 10:45
  • @GamesBrainiac you're right. The solution is very domain specific. I explained in the answer that all strings _needed_ to be with the same format. But if they are, I think is worth doing it because is very simple. – Paulo Bu Feb 20 '14 at 10:49
  • Indeed, but I was hoping you knew some way to capture a URL (heh) using regex. I've been trying to make one myself, but I fail most of the time. – Games Brainiac Feb 20 '14 at 10:51
  • URL matching regex is huge. I thought about it but due to the simplicity of the environment I suggested this kind of _naive_ approach :) – Paulo Bu Feb 20 '14 at 11:20
  • @GamesBrainiac does my answer with a regex not work? All I got for the 1 min google was a single downvote :P – jozxyqk Feb 20 '14 at 11:25
  • @jozxyqk I'm afraid it does not, did you just copy it from gskinner's regexr? – Games Brainiac Feb 20 '14 at 11:27
  • @GamesBrainiac Works fine for me (python 2.7.5). Copied near-verbatim form [here](http://daringfireball.net/2010/07/improved_regex_for_matching_urls). I had to escape the `'`. – jozxyqk Feb 20 '14 at 11:33
0
 str = javascript:changeChannel('http://some-server.com/with1234init.also', 20);
 formattedtext  ="http://" + str.split("http://")[1].split(',')[0].strip("'")
MONTYHS
  • 926
  • 1
  • 7
  • 30
0

A reasonably robust way is to take your chunk of text and search it with a URL-matching regex pattern.

See also:

Using regex...

import re
re.search(pattern, text)
... or
re.findall(pattern, text)

A full example...

>>> p = re.compile(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))')
or
>>> p = re.compile('(?i)\\b((?:https?://|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:\\\'".,<>?\xc2\xab\xc2\xbb\xe2\x80\x9c\xe2\x80\x9d\xe2\x80\x98\xe2\x80\x99]))')

>>> m = p.search("javascript:changeChannel('http://some-server.com/with1234init.also', 20);")
>>> m.group()
'http://some-server.com/with1234init.also'
  1. the pattern used is from the web URL version in the above link

    Note the use of the r prefix and the escaped ' quote towards the end in the first pattern.

  2. using re.compile caches the regex pattern

Community
  • 1
  • 1
jozxyqk
  • 16,424
  • 12
  • 91
  • 180