I think it would be better to use beautiful soup instead.
The text to parse is an iframe
tag with the src
. You are trying the retrieve the url after href=
and before &width
in the src
attribute.
After that, you would need to decode the url back to text.
First, you throw it into beautiful soup and get the attribute out of it:
text = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'
soup = BeautifulSoup(text)
src_attribute = soup.find("iframe")["src"]
And then there you could use regex here or use .split()
(quite hacky):
# Regex
link = re.search('.*?href=(.*)?&', src_attribute).group(1)
# .split()
link = src_attribute.split("href=")[1].split("&")[0]
Lastly, you would need to decode the url using urllib2
:
link = urllib2.unquote(link)
and you are done!
So the resulting code would be:
from bs4 import BeautifulSoup
import urllib2
import re
text = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'
soup = BeautifulSoup(text)
src_attribute = soup.find("iframe")["src"]
# Regex
link = re.findall('.*?href=(.*)?&', src_attribute)[0]
# .split()
link = src_attribute.split("href=")[1].split("&")[0]
link = urllib2.unquote(link)