1

I've written a script in python with selenium to scrape the complete flight schedule from a webpage. Upon running my script I could see that it is working good so far except for some fields which are not getting parsed. I've checked for the elements within which the data are located but I noticed that elements for already scraped one and the missing one are no different. What to do to get the full content. Thanks in advance.

Here is the script I'm trying with:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.yvr.ca/en/passengers/flights/departing-flights")
wait = WebDriverWait(driver, 10)

item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.yvr-flights__table")))
list_of_data = [[item.text for item in data.find_elements_by_css_selector('td')]
                    for data in item.find_elements_by_css_selector('tr')]
for tab_data in list_of_data:
    print(tab_data)

driver.quit()

Here is the partial picture of the data [missing one and scraped one]: https://www.dropbox.com/s/xaqeiq97b6upj5j/flight_stuff.jpg?dl=0

Here are the td elements for one block:

<tr class="yvr-flights__row  yvr-flights__row--departed " id="226792377">
            <td>
                <time class="yvr-flights__label yvr-flights__scheduled-label yvr-flights__scheduled-label--departed notranslate" datetime="2017-08-24T06:20:00-07:00">
                    06:20
                </time>
            </td>
            <td class="yvr-flights__table-cell--revised notranslate">
                        <time class="yvr-flights__label yvr-flights__revised-label yvr-flights__revised-label--departed" datetime="2017-08-24T06:20:00-07:00">
                            06:19
                        </time>
            </td>
            <td class="yvr-table__cell yvr-flights__flightNumber notranslate">AC560</td>
            <td class="hidden-until--md yvr-table__cell yvr-table__cell--fade-out yvr-table__cell--nowrap notranslate">Air Canada</td>
            <td class="yvr-table__cell yvr-table__cell--fade-out yvr-table__cell--nowrap notranslate">San Francisco</td>
            <td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap notranslate">
Main                
            </td>
            <td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap notranslate">E87</td>

            <td class="yvr-flights__table-cell--status yvr-table__cell--nowrap">
                    <span class="yvr-flights__status yvr-flights__status--departed">Departed</span>
            </td>
            <td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap">
            </td>
            <td class="visible-until--md">
                <button class="yvr-flights__toggle-flight">Toggle flight</button>
            </td>
        </tr>
SIM
  • 21,997
  • 5
  • 37
  • 109

3 Answers3

1

You should just open this URL and get all the details

http://www.yvr.ca/en/_api/Flights?%24filter=FlightScheduledTime%20gt%20DateTime%272017-08-24T00%3A00%3A00%27%20and%20FlightScheduledTime%20lt%20DateTime%272017-08-25T00%3A00%3A00%27%20and%20FlightType%20eq%20%27D%27&%24orderby=FlightScheduledTime%20asc

If I escape the URL it becomes like

http://www.yvr.ca/en/_api/Flights?$filter=FlightScheduledTime gt DateTime'2017-08-24T00:00:00' and FlightScheduledTime lt DateTime'2017-08-25T00:00:00' and FlightType eq 'D'&$orderby=FlightScheduledTime asc

So you should just parameterize this and replace dates based on current date get all the data in JSON form

{
odata.metadata: "http://www.yvr.ca/_api/$metadata#Flights",
value: [
{
FlightStatus: "Departed",
FlightRemarksAdjusted: "Departed",
FlightScheduledTime: "2017-08-24T06:15:00",
FlightEstimatedTime: "2017-08-24T06:10:00",
FlightNumber: "WS560",
FlightAirlineName: "WestJet",
FlightAircraftType: "73H",
FlightDeskTo: "",
FlightDeskFrom: "",
FlightCarousel: "",
FlightRange: "D",
FlightCarrier: "WS",
FlightCity: "Calgary",
FlightType: "D",
FlightAirportCode: "YYC",
FlightGate: "B14",
FlightRemarks: "Departed",
FlightID: 226790614,
FlightQuickConnect: ""
},
{
FlightStatus: "Departed",
FlightRemarksAdjusted: "Departed",
FlightScheduledTime: "2017-08-24T06:20:00",
FlightEstimatedTime: "2017-08-24T06:19:00",
Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
  • Thanks Tarun Lalwani, for your finding. Basically, the data is not what I'm after. I would like to learn how can I rectify my mistakes I've made using selenium in my pasted script above. Thanks again.. – SIM Aug 24 '17 at 17:26
  • Add a sleep after `driver.get("http://www.yvr.ca/en/passengers/flights/departing-flights")` and see if that helps. May be you are getting element faster than it is loading – Tarun Lalwani Aug 24 '17 at 17:30
  • In fact, I have made a hard coded delay but found the results even worse. – SIM Aug 24 '17 at 17:55
1

Since you are looking to fix your script and not scrape data. I found few issues in your script.

One your scanning all tr nodes. But the tr you are interested in should have yvr-flights__row class. But there are ones which are hidden and don't have data. They have yvr-flights__row--hidden. So you don't want them

Also the 2nd column of table doesn't have data always. When it has it is more like below

<td class="yvr-flights__table-cell--revised notranslate">
                        <time class="yvr-flights__label yvr-flights__revised-label yvr-flights__revised-label--early" datetime="2017-08-25T06:30:00-07:00">
                            06:20
                        </time>
            </td>

So you when you use .text on the td. The node itself has no text. But it has a time node which has the text. There are multiple ways to fix that. But I use JS to get the content of such node

driver.execute_script("return arguments[0].textContent;").trim() 

So if you combine all of it below script does all the work

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.yvr.ca/en/passengers/flights/departing-flights")
wait = WebDriverWait(driver, 10)

item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.yvr-flights__table")))
list_of_data = [
    [
        item.text if item.text else driver.execute_script("return arguments[0].textContent.trim();", item).strip()
        for item in data.find_elements_by_css_selector('td')
    ]
    for data in item.find_elements_by_css_selector('tr.yvr-flights__row:not(.yvr-flights__row--hidden)')
]

for tab_data in list_of_data:
    print(tab_data)

It gives me the below output

['02:00', '02:20', 'CX889', 'Cathay Pacific', 'Hong Kong', 'Main', 'D64', 'Departed', '', 'Toggle flight']
['05:15', '', 'PR127', 'Philippine Airlines', 'Manila', 'Main', 'D70', 'Departed', '', 'Toggle flight']
['06:00', '', 'AS964', 'Alaska Airlines', 'Seattle', 'Main', 'E73', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'DL4805', 'Delta Air Lines', 'Seattle', 'Main', 'E90', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'WS3114', 'WestJet', 'Kelowna', 'Main', 'A9', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'AA6045', 'American Airlines', 'Los Angeles', 'Main', 'E86', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'AC100', 'Air Canada', 'Toronto', 'Main', 'C45', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:01', '', 'UA618', 'United Airlines', 'San Francisco', 'Main', 'E76', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC8606', 'Air Canada', 'Winnipeg', 'Main', 'C39', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC8190', 'Air Canada', 'Kamloops', 'Main', 'C34', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC200', 'Air Canada', 'Calgary', 'Main', 'C29', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:15', '', 'WS560', 'WestJet', 'Calgary', 'Main', 'B13', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:20', '', 'AC560', 'Air Canada', 'San Francisco', 'Main', 'E87', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:30', '06:20', 'DL2555', 'Delta Air Lines', 'Minneapolis', 'Main', 'E88', 'Early', 'NOTIFY ME', 'Toggle flight']
['06:30', '', 'WS700', 'WestJet', 'Toronto', 'Main', 'B15', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:30', '', 'UA664', 'United Airlines', 'Chicago', 'Main', 'E75', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:40', '', 'AM695', 'AeroMexico', 'Mexico City', 'Main', 'D53', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:40', '', 'WS6110', 'WestJet', 'Mexico City', 'Main', 'D53', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:45', '06:45', 'AC8055', 'Air Canada', 'Victoria', 'Main', '', 
...
['23:25', '', 'AC8269', 'Air Canada', 'Nanaimo', 'Main', '', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AM697', 'AeroMexico', 'Mexico City', 'Main', 'D54', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'WS6108', 'WestJet', 'Mexico City', 'Main', 'D54', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AC8083', 'Air Canada', 'Victoria', 'Main', 'C38', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AC308', 'Air Canada', 'Montreal', 'Main', 'C29', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:26', '', 'WS564', 'WestJet', 'Montreal', 'Main', 'B13', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:30', '', 'AC128', 'Air Canada', 'Toronto', 'Main', 'C47', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:40', '', 'AC33', 'Air Canada', 'Sydney', 'Main', 'D52', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:45', '', 'AC35', 'Air Canada', 'Brisbane', 'Main', 'D65', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:45', '', 'AC344', 'Air Canada', 'Ottawa', 'Main', 'C49', 'On Time', 'NOTIFY ME', 'Toggle flight']
Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
  • It doesn't solve the issue but makes the result look better. Thanks Tarun for your answer. You always come up with something new. This javascript command will help me in future. Let me put some weight on your reputation score. Btw, I have already used your provided api as well and that was very easy to deal with. Thanks for everything. – SIM Aug 25 '17 at 13:56
  • @Topto, what part of the issue is not solved? Let me know and I will tell you what needs to be done – Tarun Lalwani Aug 25 '17 at 15:52
0

as suggested by Tarun Lalwani, WebDriver is really the wrong tool for this activity.

The problem is that webdriver only returns text from elements that are visible on the screen, so if you want to see the data from all the rows you will need to scroll down the rows and collect the data one row at a time as discussed in WebElement getText() is an empty string in Firefox if element is not physically visible on the screen This will be painfully slow.

I guess you could also grab the textcontent instead of item.text in java:

item.getAttribute("textContent");

I'm sure python has an equivalent.

jsoup would be an alternative means to grab the data in a single shot and much faster