0

Novice programmer so in advance, sorry if what I'm writing is badly worded or just plain stupid.

I'm trying to scrape info from a website and store the results in a database. The goal is to get all the train numbers, the stations and see if the train is late or not. The way I started doing it is in a loop, I've been building up this URL by changing $LETTER with each letter of the alphabet, one at a time: https://reservia.viarail.ca/GetStations.aspx?q=$LETTER

I then parse the results and store everything correctly in a database. This script doesn't take a long time to run so that's no issue. The issue comes when I'm trying to get all the trains that pass through each station. To do this, I go through every station that was stored previously (580 of them) and then using this URL and changing the $DATE for today in YYY-MM-DD and $CODE with the station code:

reservia.viarail.ca/tsi/GetTrainList.aspx?OriginStationCode=$CODE&Date=$DATE

So for example, I would have This link for Montreal

and I would go through each element of the table and see the train number to then insert it in a table. That was my plan so far but it is taking way too much time to run this script (over 7 minutes) which makes sense since we're opening 580 pages.

What's a better way of doing this? I'm using python as I'm trying to learn it so I've been importing the urllib library and using it to decode the page, and then I would sort through the data. Thanks for any suggestion/help!

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • You should look into making [asynchronous requests](https://hackernoon.com/how-to-run-asynchronous-web-requests-in-parallel-with-python-3-5-without-aiohttp-264dc0f8546?gi=a0a59d394aed) if you aren't going to run into rate limits. – Jack Moody Mar 04 '19 at 15:48
  • OK then I would technically be sending all 580 requests at the same time? Can I still do this even though the URL is built in a for loop since I'm getting the code of each station from a database table? I would probably have to change the structure of the code, right? – Michel Georges Najarian Mar 04 '19 at 15:55
  • Normally you don't want to send all of the 580 requests at the same time. I would suggest sending 5-10 requests at a time so that you don't overload the website you are scraping. As long as your results later in the loop don't rely on previous results, this will work. So yes, you will have to change some of your code. – Jack Moody Mar 04 '19 at 15:59
  • Possible duplicate of [What is the fastest way to send 100,000 HTTP requests in Python?](https://stackoverflow.com/questions/2632520/what-is-the-fastest-way-to-send-100-000-http-requests-in-python) – Jack Moody Mar 04 '19 at 15:59
  • Thanks Jack, will take a look! And none of the results depend on the previous ones so that's fine. – Michel Georges Najarian Mar 04 '19 at 16:02
  • Or, since you're a novice anyway, maybe switch to an asynchronous language like js / go – pguardiario Mar 05 '19 at 00:40
  • @pguardiario I'm working on different project with different languages to learn new stuff with each one of them. I'm good enough with javascript and using I'm go for another project where I'm scrapping info as well from different websites but I wanted to use python for this one for no real reason. I do already have a few working scripts and I want to saty uniform, continuing to learn python. So far I'm adpet (I would even say intermediate level) in C++. Any good resources to learn more on https requests, xml, xhr requests, etc? – Michel Georges Najarian Mar 05 '19 at 19:45
  • This isn't the right place for this but there's a lot of good reasons to switch to js for projects like this. – pguardiario Mar 05 '19 at 22:42

1 Answers1

0

I like questions like this! Ok, the code below should do almost exactly what you want.

import json,urllib.request
import requests
import pandas as pd
from string import ascii_lowercase

alldata = []
for c in ascii_lowercase:
    response = requests.get('https://reservia.viarail.ca/GetStations.aspx?q=' + c)
    json_data = response.text.encode('utf-8', 'ignore') 
    df = pd.DataFrame(json.loads(json_data), columns=['sc', 'sn', 'pv'])  # etc., 
    alldata.append(df)

Now, just load that list into your database. Done.

ASH
  • 20,759
  • 19
  • 87
  • 200