Web scraping bus schedules with Python

Question

I want to scrape the bus schedule times from the following website https://ul.se . It's in Swedish but there is an English option and in any case it doesn't affect my main question. By putting the locations we are interested in the search fields we arrive at the following link which is an examples of ones I am interested in. https://www.ul.se/#/700600/0/Uppsala%20Centralstation%20(Uppsala)/700591/0/Stadsbiblioteket%20(Uppsala)//2/

By right clicking and inspecting the elements I want to scrape I can see them in dev tools but at the whole page source it is not listed. For example, looking at the first available time for the next bus by copying the selector I get

#travelResults > ul-search-journey-result-card:nth-child(3) > div > div.result-head > div.result-row.time > span.dpt-arr-time.ng-binding

This does not work at all in my code. Even if I forget BeautifulSoup (which is what I am trying to use) and just search the full html for any numbers indicating time, there is still nothing. In the entire source, there appear to be no numbers that indicate time. What am I missing? How could that problem be tackled? Any help is appreciated, I am very new to all of this.

Such data could be fetched by API, open up the Browser Console (F12) on Network tab, and look there. — crayxt, Jun 27 '21 at 10:39
Does this answer your question? [How to get missing HTML data when web scraping with python-requests](https://stackoverflow.com/questions/58249868/how-to-get-missing-html-data-when-web-scraping-with-python-requests) — Jim G., Jun 27 '21 at 10:41

score 1 · Answer 1 · answered Jun 27 '21 at 10:44

HTTP GET https://www.ul.se/api/journey/search?changeTimeType=0&dateTime=&from=Uppsala+Centralstation+(Uppsala)&fromPointId=700600&fromPointType=0&maxWalkDistance=3000&priorityType=0&to=Stadsbiblioteket+(Uppsala)&toPointId=700591&toPointType=0&trafficTypes=1,2,3,4,5,6,7,8,9,10,11&travelWhenType=2&via=&viaPointId=&walkSpeedType=0

return the data you are looking for. (In the browser: F12 > Network > XHR )

score 1 · Answer 2 · answered Jun 27 '21 at 10:49

And if you parse the complex json arrived, you can find many columns

import requests
import pandas as pd

url = """https://www.ul.se/api/journey/search?changeTimeType=0&dateTime=&from=Uppsala+Centralstation+(Uppsala)&fromPointId=700600&fromPointType=0&maxWalkDistance=3000&priorityType=0&to=Stadsbiblioteket+(Uppsala)&toPointId=700591&toPointType=0&trafficTypes=1,2,3,4,5,6,7,8,9,10,11&travelWhenType=2&via=&viaPointId=&walkSpeedType=0"""


a = requests.get(url)
>>> pd.json_normalize(json.loads(json.loads(a.text)["Payload"])).columns
Index(['journeyKey', 'departureDateTime', 'departureIsTimingPoint',
       'hasRealTimeDepartureDeviation', 'realTimeDepartureDateTime',
       'arrivalDateTime', 'arrivalIsTimingPoint',
       'hasRealTimeArrivalDeviation', 'realTimeArrivalDateTime', 'travelTime',
       'routeLinks', 'priceZoneList', 'noOfChanges', 'zones', 'from.id',
       'from.name', 'from.area', 'from.type', 'from.coordinate.latitude',
       'from.coordinate.longitude', 'to.id', 'to.name', 'to.area', 'to.type',
       'to.coordinate.latitude', 'to.coordinate.longitude', 'ticketType.code',
       'ticketType.category', 'ticketType.name.parts',
       'ticketType.name.additions', 'ticketType.zones',
       'ticketType.validSeconds', 'ticketType.priceClasses',
       'ticketType.expires', 'ticketType.additionalTicketTypes'],
      dtype='object')

score 1 · Accepted Answer · answered Jun 27 '21 at 10:53

This example shows how to parse the routes from json data using json module:

import json
import requests


# url = "https://www.ul.se/#/700600/0/Uppsala%20Centralstation%20(Uppsala)/700591/0/Stadsbiblioteket%20(Uppsala)//2/"

api_url = "https://www.ul.se/api/journey/search"

params = {
    "changeTimeType": "0",
    "dateTime": "",
    "from": "Uppsala Centralstation (Uppsala)",  # <-- 1.
    "fromPointId": "700600",  # <-- 2.
    "fromPointType": "0",
    "maxWalkDistance": "3000",
    "priorityType": "0",
    "to": "Stadsbiblioteket (Uppsala)",  # <-- 3.
    "toPointId": "700591",  # <-- 4.
    "toPointType": "0",
    "trafficTypes": "1,2,3,4,5,6,7,8,9,10,11",
    "travelWhenType": "2",
    "via": "",
    "viaPointId": "",
    "walkSpeedType": "0",
}

data = requests.get(api_url, params=params).json()
data = json.loads(data["Payload"])

# uncomment to print all data:
# print(json.dumps(data, indent=4))

for journey in data:
    for route in journey["routeLinks"]:
        print(
            "{:<40} {:<30} {:<20} {:<20}".format(
                route["from"]["name"],
                route["to"]["name"],
                route["arrivalDateTime"],
                route["departureDateTime"],
            )
        )
    print()

Prints:

Uppsala Centralstation (Uppsala)         Klostergatan (Uppsala)         2021-06-27T12:56:00  2021-06-27T12:55:00 
Klostergatan (Uppsala)                   Stadsbiblioteket (Uppsala)     2021-06-27T13:01:00  2021-06-27T12:56:00 

Uppsala Centralstation (Uppsala)         Klostergatan (Uppsala)         2021-06-27T13:02:00  2021-06-27T13:00:00 
Klostergatan (Uppsala)                   Stadsbiblioteket (Uppsala)     2021-06-27T13:07:00  2021-06-27T13:02:00 

Uppsala Centralstation (Uppsala)         Klostergatan (Uppsala)         2021-06-27T13:12:00  2021-06-27T13:10:00 
Klostergatan (Uppsala)                   Stadsbiblioteket (Uppsala)     2021-06-27T13:17:00  2021-06-27T13:12:00 

Uppsala Centralstation (Uppsala)         Stadsbiblioteket (Uppsala)     2021-06-27T13:20:00  2021-06-27T13:15:00 

Uppsala Centralstation (Uppsala)         Klostergatan (Uppsala)         2021-06-27T13:20:00  2021-06-27T13:18:00 
Klostergatan (Uppsala)                   Stadsbiblioteket (Uppsala)     2021-06-27T13:25:00  2021-06-27T13:20:00 

Uppsala Centralstation (Uppsala)         Klostergatan (Uppsala)         2021-06-27T13:21:00  2021-06-27T13:20:00 
Klostergatan (Uppsala)                   Stadsbiblioteket (Uppsala)     2021-06-27T13:26:00  2021-06-27T13:21:00

Thank you very much for an answer involving code! It helps a lot newbies like me. This looks to me like magic at the moment, but I am eager to learn. Where can I find information on how to do this myself? For example, how did you know which api_url to take, to use the parameters in the requests.get call etc? Is there some good series tutorials on webscraping you would recommend? — Esoog, Jun 27 '21 at 16:39
@Esoog When you open Firefox Developer Tools -> Network Tab (Chrome has something similar too), you will see all requests the page is doing. When the data isn't inside the page, the page must load them somewhere... — Andrej Kesely, Jun 27 '21 at 16:50

Web scraping bus schedules with Python

3 Answers3