python scrape flight and price data from skyscanner

Question

I am trying to get the price data from the following url. However I can only seem to get the text from 'div's down to a certain level, here is my code:

from selenium import webdriver
from bs4 import BeautifulSoup

def scrape_flight_prices(URL):

    browser = webdriver.PhantomJS()
    # PARSE THE HTML
    browser.get(URL)
    soup = BeautifulSoup(browser.page_source, "lxml")
    page_divs = soup.findAll("div", attrs={'id':'app-root'}) 
    for p in page_divs:
        print(p)

if __name__ == '__main__':
  URL1="https://www.skyscanner.net/transport/flights/brs/gnb/190216/190223/?adults=1&children=0&adultsv2=1&childrenv2=&infants=0&cabinclass=economy&rtn=1&preferdirects=false&outboundaltsenabled=false&inboundaltsenabled=false&ref=home#results"

And here is the output:

<div id="app-root">
<section class="day-content state-loading state-no-results" id="daysection">
<div class="day-searching">
<div class="hot-spinner medium"></div>
<div class="day-searching-message">Searching</div>
</div>
</section>
</div>

The section of html I want to scrape from looks like this:

https://www.skyscanner.net/transport/flights/brs/gnb/190216/190223/?adults=1&children=0&adultsv2=1&childrenv2=&infants=0&cabinclass=economy&rtn=1&preferdirects=false&outboundaltsenabled=false&inboundaltsenabled=false&ref=home#results

However when I try and scrape with the following code:

prices = soup.findAll("a", attrs={'target':"_blank", "data-e2e":"itinerary-price", "class":"CTASection__price-2bc7h price"})  
for p in prices:
    print(p)

It prints nothing! I suspect a js script is running something to generate the rest of the the code and/or data? Can anyone help me extract the data? Specifically I am trying to get the price, flight times, airline name etc but if beautiful soup is not printing the relevant html from the page then I'm not sure how else to get it?

Would appreciate any pointers! Many thanks in advance!

Andersson · Accepted Answer · 2018-10-27T21:31:16.213

3

Try below code to get prices:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC

prices = [price.text for price in wait(browser, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "price")))]
print(prices)

edited Oct 27 '18 at 21:31

answered Oct 27 '18 at 21:19

Andersson

51,635
17
77
129

Thanks for your help, I can't get this to work though, 'driver' isn't defined. I tried swapping it for 'webdriver.PhantomJS()' and with 'soup' but neither worked. I'm not all that sure what I'm doing with it to be fair? – user3062260 Oct 27 '18 at 21:29
@user3062260 , oh... right. replace `driver` with `browser` – Andersson Oct 27 '18 at 21:31
1

it raises a 'TimeoutException' – user3062260 Oct 27 '18 at 21:40
@user3062260 , are you sure that target web-page is opened instead of *Person/Robot verification*? – Andersson Oct 27 '18 at 21:45
It seems to return 'div's when I print them in a loop but not any tags that are buried a bit deeper in the html, it seems to be the dynamic content on the site like prices – user3062260 Oct 27 '18 at 22:16
soup.findAll("div", attrs={'id':'app-root'})[0].find("section", attrs={'id':'day-section'}) should return a div of class "day-cols clearfix" which contains the data but this div doesn't get retrieved – user3062260 Oct 27 '18 at 22:20
@user3062260 , I've tried it in Chrome and it works fine. But sometimes *Person/Robot verification* page opens and, yeah, it of course fails – Andersson Oct 28 '18 at 07:45
I've tried changing "browser = webdriver.PhantomJS()" to "browser = webdriver.Chrome()" but it says that chromedriver needs to be in the path. Is that an extra library that I would need to install? What version of python are you using? – user3062260 Oct 28 '18 at 09:08
@user3062260 , just [download last Chromedriver version](https://chromedriver.storage.googleapis.com/index.html?path=2.43/) and put it in the same folder where PhantomJS/Python executable located or specify the path to chromedriver explicitly `driver = webdriver.Chrome('/path/to/chromedriver') ` – Andersson Oct 28 '18 at 09:14
Awesome!! That works now - I just didnt' realise you had to explicitly tell webdriver.Chrome where the executable is! Thanks so much for all your help! One last question - it opens an actual browser, is there a way to get it to run silently in the background? I was going to put this in a loop and scrape quite a few prices periodically. Unless I can send the same browser a loop of URLs? – user3062260 Oct 28 '18 at 11:32
@user3062260 , you can check [how to use headless Chrome](https://stackoverflow.com/questions/46920243/how-to-configure-chromedriver-to-initiate-chrome-browser-in-headless-mode-throug). Also there is no need to open new browser instance for getting each page. You can define list of URLs (`url_list = ['URL1', 'URL2', ...'URLn']`) and iterate through it `for url in url_list: driver.get(url)` – Andersson Oct 28 '18 at 11:49

python scrape flight and price data from skyscanner

1 Answers1