How to extract a link from the embedded link with python?

Question

I have a string like this:

<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>

I want to extract link:

www.facebook.com/DoctorTaniya/posts/1906676949620646

How to write a python script to do this?

@hallazzang not necessarily since it is in HTML, normal HTML parsers can work too. — Moon Cheesez, Apr 11 '17 at 05:34
@hallazzang I figured that out but could not write the regex needed. — kello, Apr 11 '17 at 05:36
@MoonCheesez yes but even with html parsers, still regular expression is a good choice for extracting link from `iframe[src]`. — hallazzang, Apr 11 '17 at 09:37

score 2 · Accepted Answer · answered Apr 11 '17 at 07:06

I think it would be better to use beautiful soup instead.

The text to parse is an iframe tag with the src. You are trying the retrieve the url after href= and before &width in the src attribute.

After that, you would need to decode the url back to text.

First, you throw it into beautiful soup and get the attribute out of it:

text = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'
soup = BeautifulSoup(text)

src_attribute = soup.find("iframe")["src"]

And then there you could use regex here or use .split() (quite hacky):

# Regex
link = re.search('.*?href=(.*)?&', src_attribute).group(1)

# .split()
link = src_attribute.split("href=")[1].split("&")[0]

Lastly, you would need to decode the url using urllib2:

link = urllib2.unquote(link)

and you are done!

So the resulting code would be:

from bs4 import BeautifulSoup
import urllib2
import re

text = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'
soup = BeautifulSoup(text)

src_attribute = soup.find("iframe")["src"]

# Regex
link = re.findall('.*?href=(.*)?&', src_attribute)[0]
# .split()
link = src_attribute.split("href=")[1].split("&")[0]

link = urllib2.unquote(link)

score 0 · Answer 2 · edited May 23 '17 at 11:54

0

Here is some useful information about Regex to find urls in Python.

If all the urls you code will work with start right after a .php?href= then you can create a loop that stops when the ?href= is found and split the string.

Or you can use $_GET[] and print it, here is other post you might want to read.

edited May 23 '17 at 11:54

Community

1
1

answered Apr 11 '17 at 05:36

Raven H.

72
10

score 0 · Answer 3 · answered Apr 11 '17 at 06:44

import re

string = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'

m = re.search( r'href=https%3A%2F%2F(.*)&width', string)
str2 = m.group(1)
str2.replace('%2F', '/')

Output

>>> str2.replace('%2F', '/')
'www.facebook.com/DoctorTaniya/posts/1906676949620646'

score 0 · Answer 4 · answered Apr 11 '17 at 07:08

Use a combination of BeautifulSoup, re and urllib:

from bs4 import BeautifulSoup
import re, urllib

html = """
<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>
<p>some other rubbish here</p>
"""

# da soup
soup = BeautifulSoup(html, 'html5lib')

# href, (anything not &) afterwards
rx = re.compile(r'href=([^&]+)')

for iframe in soup.findAll('iframe'):
    link = urllib.unquote(rx.search(iframe['src']).group(1))
    print(link)
    # https://www.facebook.com/DoctorTaniya/posts/1906676949620646

It parses the DOM, looks for iframes, analyzes these with a regular expression and unquotes the found URL. Thus, you do not act on the DOM directly.

How to extract a link from the embedded link with python?

4 Answers4