Using Python to Scrape Sky Cinema List

Question

I'd like to gather a list of films and their links to all available movies on Sky Cinema website.

The website is:

http://www.sky.com/tv/channel/skycinema/find-a-movie#/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200

I am using Python 3.6 and Beautiful Soup.

I am having problems finding the title and link. Especially as there are several pages to click through - possibly based on scroll position (in the URL?)

I've tried using BS and Python but there is no output. The code I have tried would only return the title. I'd like the title and the link to the film. As these are in different areas on the site, I am unsure on how this is done.

Code I have tried:

from bs4 import BeautifulSoup
import requests

link = "http://www.sky.com/tv/channel/skycinema/find-a-movie#/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200"
r = requests.get(link)
page = BeautifulSoup(r.content, "html.parser")

for dd in page.find_all("div", {"class":"sentence-result-infos"}):
    title = dd.find(class_="title ellipsis ng-binding").text.strip()
    print(title)

spans=page.find_all('span', {'class': 'title ellipsis ng-binding'})
for span in spans:
    print(span.text)

I'd like the output to show as the title, link.

EDIT:

I have just tried the following but get "text" is not an attribute:

from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('http://www.sky.com/tv/channel/skycinema/find-a-movie/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200')
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('span', {'class': 'title ellipsis ng-binding'}).text.strip()
print(title)

What do you mean by link? Please, provide an example of link for a title. — sentence, Apr 28 '19 at 10:59
On the page, when you click on the poster it takes to you the page. It's under "sentence-result-pod ng-isolate-scope", linked through href. For example: — Mr O, Apr 28 '19 at 11:04

score 1 · Accepted Answer · answered Apr 28 '19 at 11:24

There is an API to be found in network tab. You can get all results with one call. You can set the limit to a number greater than the expected result count

r = requests.get('http://www.sky.com/tv/api/search/movie?limit=10000&window=skyMovies').json()

Or use the number you can see on the page

import requests
import pandas as pd

base = 'http://www.sky.com/tv'
r = requests.get('http://www.sky.com/tv/api/search/movie?limit=1555&window=skyMovies').json()

data = [(item['title'], base + item['url']) for item in r['items']]
df = pd.DataFrame(data, columns = ['Title', 'Link'])
print(df)

score 0 · Answer 2 · answered Apr 28 '19 at 11:11

0

First of all, read terms and conditions of the site you are going to scrape.

Next, you need selenium:

from selenium import webdriver
import bs4

# MODIFY the url with YOURS
url = "type the url to scrape here"


driver = webdriver.Firefox()
driver.get(url)

html = driver.page_source
soup = bs4.BeautifulSoup(html, "html.parser")

baseurl = 'http://www.sky.com/'

titles = [n.text for n in soup.find_all('span', {'class':'title ellipsis ng-binding'})]
links = [baseurl+h['href'] for h in soup.find_all('a', {'class':'sentence-result-pod ng-isolate-scope'})]

answered Apr 28 '19 at 11:11

sentence

8,213
4
31
40

Thank you for this. I've had to change Firefox to Chrome. When I run the above and add the URL in (from original post), I get a blank Chrome page with no data. Thank you again. – Mr O Apr 28 '19 at 11:26
Read [this post](https://stackoverflow.com/questions/22130109/cant-use-chrome-driver-for-selenium) to use chrome driver for `selenium`. – sentence Apr 28 '19 at 11:31

Using Python to Scrape Sky Cinema List

2 Answers2