1

I have a string like this:

<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>

I want to extract link:

www.facebook.com/DoctorTaniya/posts/1906676949620646

How to write a python script to do this?

kello
  • 117
  • 1
  • 1
  • 11

4 Answers4

2

I think it would be better to use beautiful soup instead.

The text to parse is an iframe tag with the src. You are trying the retrieve the url after href= and before &width in the src attribute.

After that, you would need to decode the url back to text.

First, you throw it into beautiful soup and get the attribute out of it:

text = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'
soup = BeautifulSoup(text)

src_attribute = soup.find("iframe")["src"]

And then there you could use regex here or use .split() (quite hacky):

# Regex
link = re.search('.*?href=(.*)?&', src_attribute).group(1)

# .split()
link = src_attribute.split("href=")[1].split("&")[0]

Lastly, you would need to decode the url using urllib2:

link = urllib2.unquote(link)

and you are done!

So the resulting code would be:

from bs4 import BeautifulSoup
import urllib2
import re

text = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'
soup = BeautifulSoup(text)

src_attribute = soup.find("iframe")["src"]

# Regex
link = re.findall('.*?href=(.*)?&', src_attribute)[0]
# .split()
link = src_attribute.split("href=")[1].split("&")[0]

link = urllib2.unquote(link)
Moon Cheesez
  • 2,489
  • 3
  • 24
  • 38
0

Here is some useful information about Regex to find urls in Python.

If all the urls you code will work with start right after a .php?href= then you can create a loop that stops when the ?href= is found and split the string.

Or you can use $_GET[] and print it, here is other post you might want to read.

Community
  • 1
  • 1
Raven H.
  • 72
  • 10
0
import re

string = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'

m = re.search( r'href=https%3A%2F%2F(.*)&width', string)
str2 = m.group(1)
str2.replace('%2F', '/')

Output

>>> str2.replace('%2F', '/')
'www.facebook.com/DoctorTaniya/posts/1906676949620646'
Abhishek Menon
  • 753
  • 4
  • 15
0

Use a combination of BeautifulSoup, re and urllib:

from bs4 import BeautifulSoup
import re, urllib

html = """
<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>
<p>some other rubbish here</p>
"""

# da soup
soup = BeautifulSoup(html, 'html5lib')

# href, (anything not &) afterwards
rx = re.compile(r'href=([^&]+)')

for iframe in soup.findAll('iframe'):
    link = urllib.unquote(rx.search(iframe['src']).group(1))
    print(link)
    # https://www.facebook.com/DoctorTaniya/posts/1906676949620646

It parses the DOM, looks for iframes, analyzes these with a regular expression and unquotes the found URL. Thus, you do not act on the DOM directly.

Jan
  • 42,290
  • 8
  • 54
  • 79