-1

Help with python please. I have tried to scrape webpage using python. when I try to get iframe src in this url it gives me only one iframe source.

This is the webpage I tried to scrape.


Source 1



Source 2


Source 2

this is my python code:

iframe = re.compile( '<iframe.*src="(.*?)"' ).findall( html )

this one gives me only 1 iframe. But there are 4 iframes

Thank you

CyberHelp
  • 63
  • 1
  • 2
  • 9

3 Answers3

1

It is highly recommended to not parse HTML with regular expressions. For Python, Beautiful Soup is a widely used option that does this parsing for you.

For extracting your <iframe/> sources, you could use something like this

from bs4 import BeautifulSoup
import requests

resp = requests.get(url)
soup = BeautifulSoup(resp.text)
for frame in soup.findAll('iframe'):
    print(frame['src'])

For the URL that you have specified, this will yield the following result

http://www.playhd.video/embed.php?vid=xxx
http://mersalaayitten.com/embed/xxx
http://www.playhd.video/embed.php?vid=xxx
http://googleplay.tv/videos/kanithan?iframe=true
//www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2Fkathaltamilmovie&width=600&height=188&colorscheme=light&show_faces=true&header=false&stream=false&show_border=true
Community
  • 1
  • 1
Suever
  • 64,497
  • 14
  • 82
  • 101
1

If you just want the four that are together, you can get the data from the second table which holds the four iframes using BeautifulSoup css-selectors, in particular nth-of-type(2) to pull the second table :

from bs4 import BeautifulSoup
import requests

html = requests.get("http://kathaltamil.com/?v=Kanithan").content
soup = BeautifulSoup(html)

urls = [ifr["src"] for ifr in soup.select("table:nth-of-type(2)")[0].select("iframe")]

Which will give you just the four:

['http://www.playhd.video/embed.php?vid=621', 
'http://mersalaayitten.com/embed/3752', 
'http://www.playhd.video/embed.php?vid=584', 
'http://googleplay.tv/videos/kanithan?iframe=true']

Or even easier with lxml and xpath:

import requests

html = requests.get("http://kathaltamil.com/?v=Kanithan").content


from lxml.etree import fromstring, HTMLParser

xml = fromstring(html, HTMLParser())

print(xml.xpath("//table[2]//iframe/@src"))

Which gives you the same:

['http://www.playhd.video/embed.php?vid=621',
 'http://mersalaayitten.com/embed/3752', 
'http://www.playhd.video/embed.php?vid=584', 
'http://googleplay.tv/videos/kanithan?iframe=true']

Whatever you choose is going to be a better option than your regex.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
0

Seems like you forgot a question mark (?) after the first .* The correct way would be so:

iframe = re.compile( '<iframe.*?src="(.*?)"' ).findall( html )

Overall though, keep in mind regexes are not a good way for parsing html webpages. Beautiful soup, lxml, scrapy, and other libraries will be more efficient and powerfull.

Bharel
  • 23,672
  • 5
  • 40
  • 80