0

I want to extract a full URL from a string.

My code is:

import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
print re.match(r'(ftp|http)://.*\.(jpg|png)$', data)

Output:

None

Expected Output

http://www.google.com/a.jpg

I found so many questions on StackOverflow, but none worked for me. I have seen many posts and this is not a duplicate. Please help me! Thanks.

Will
  • 24,082
  • 14
  • 97
  • 108
shiv shankar
  • 19
  • 1
  • 6
  • 5
    This has been answered lots of time, e.g. http://stackoverflow.com/questions/6883049/regex-to-find-urls-in-string-in-python – apotry Feb 05 '16 at 07:55

3 Answers3

4

You were close!

Try this instead:

r'(ftp|http)://.*\.(jpg|png)'

You can visualize this here.

I would also make this non-greedy like this:

r'(ftp|http)://.*?\.(jpg|png)'

You can visualize this greedy vs. non-greedy behavior here and here.

By default, .* will match as much text as possible, but you want to match as little text as possible.

Your $ anchors the match at the end of the line, but the end of the URL is not the end of the line, in your example.

Another problem is that you're using re.match() and not re.search(). Using re.match() starts the match at the beginning of the string, and re.search() searches anywhere in the string. See here for more information.

Will
  • 24,082
  • 14
  • 97
  • 108
  • 1
    upvote for the visualize. however, it is not valid in some sense. For ex, http://http://.xjpg. – AlexWei Feb 05 '16 at 08:13
  • 1
    Thanks! I fixed the visualizations so the original regex correction is shown, as well as the greedy vs. non-greedy match. And yes, this isn't a great regex to match URLs in all forms, but that's answered elsewhere, and my goal is to show what the main problems with OPs regex are for the examples given :) – Will Feb 05 '16 at 08:17
  • 1
    Thankyou, I got it now! – shiv shankar Feb 05 '16 at 08:21
  • No problem, glad to help! – Will Feb 05 '16 at 08:25
1

You should use search instead of match.

import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
url=re.search('(ftp|http)://.*\.(jpg|png)', data)
if url:
   print url.group(0)
0

Find the start of the url by using find(http:// , ftp://) . Find the end of url using find(jpg , png). Now get the substring

data = "ahahahttp://www.google.com/a.jpg>hhdhd"
start = data.find('http://')
kk = data[start:]
end = kk.find('.jpg')
print kk[0:end+4]