1

I am doing a web crawler project which is supposed to take two dates as input (like 2019-03-01 and 2019-03-05) and then attaches every day between these two dates at the end of a base link (for example base link + date is https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/2019-1-3). I want to extract a table with "tablesaw-sortable" class_name in web_page source and save that in text file or any other file format alike.

I developed this code:

from datetime import timedelta, date
from bs4 import BeautifulSoup
import urllib.request
from selenium import webdriver

class webcrawler():
    def __init__(self, st_date, end_date):
        self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
        self.st_date = st_date
        self.end_date = end_date

    def date_list(self):
        return [str(date1 + timedelta(n)) for n in range(int ((self.end_date - self.st_date).days)+1)]

    def create_link(self, attachment):
        url = str(self.base_url) 
        url += attachment
        return url

    def open_link(self, link):
        driver = webdriver.PhantomJS()
        driver.get(link)
        html = driver.page_source
        return html

    def extract_table(self, html):
        soup = BeautifulSoup(html)
        print(soup.prettify())

    def output_to_csv(self):
        pass

date1 = date(2018, 3, 1)
date2 = date(2019, 3, 5)

test = webcrawler(st_date=date1, end_date=date2)
date_list = test.date_list()
link = test.create_link(date_list[0])
html = test.open_link(link)
test.extract_table(html)

The problem is that it takes so long for me to wait for getting page.source of just one link. I already used urllib.request but the problem with that method is that sometimes it gets the html content without waiting for the table to be fully loaded.

How can I speed up the process and just extract the mentioned table and access its html source and don't wait for the rest. I just want the information in table rows to be saved in some text file for each date.

Can anybody help me to deal with the problem?

Masoud Masoumi Moghadam
  • 1,094
  • 3
  • 23
  • 45
  • Have you considered **not** using Selenium? If you only care about the HTML and you do not care to render the page, why use a rendering engine?!?! I am sure Python has some library for retrieving page from a URL. – SiKing Apr 03 '19 at 23:22
  • @SiKing I used `requests` library method but the problem is that it get the page source before the element `table` with class name `tablesaw-sortable` being loaded. I just want to retrieve that table and stop operations on web page after that. – Masoud Masoumi Moghadam Apr 04 '19 at 17:16

1 Answers1

1

There are quite a few notable things wrong with this code and how you are using the libraries. Let me try to fix it up.

First, I don't see you using the urllib.request library. You can remove this, or if you are using it in another spot in your code, I recommend the highly appraised requests module. I also recommend using the requests library instead of selenium, if you are only trying to get the HTML source from a site, as selenium is more designed towards navigating sites and acting as a 'real' person.

You can use response = requests.get('https://your.url.here') and then response.text to get the returned HTML.

Next I noticed in the open_link() method, you are creating a new instance of the PhantomJS class each time you call the method. This is really inefficient as selenium uses a lot of resources (and takes a long time, depending on the driver you are using). This may be a big contributor to your code running slower than desired. You should reuse the driver instance as much as possible, as selenium was designed to be used that way. A great solution to this would be creating the driver instance in the webcrawler.__init__() method.

class WebCrawler():
    def __init__(self, st_date, end_date):
        self.driver = webdriver.PhantomJS()
        self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
        self.st_date = st_date
        self.end_date = end_date

    def open_link(self, link):
        self.driver.get(link)
        html = driver.page_source
        return html

# Alternatively using the requests library

class WebCrawler():
    def __init__(self, st_date, end_date):
        self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
        self.st_date = st_date
        self.end_date = end_date

    def open_link(self, link):
        response = requests.get(link)
        html = response.text
        return html

Side note: For class names, you should use CamelCase instead of lowercase. This is just a suggestion, but the original creator of python has created PEP8 to define a general style guide for writing python code. Check it out here: Class Naming

Another odd thing I found was that you are casting a string to... string. You do this at url = str(self.base_url). This doesn't hurt anything, but also doesn't help. I can't find any resources/links but I have a suspicion that this take extra time for the interpreter. Since speed is a concern, I recommend just using url = self.base_url since the base url is already a string.

I see that you are formatting and creating urls by hand, but if you want a bit more control and less bugs, check out the furl library.

def create_link(self, attachment):
        f = furl(self.base_url)

        # The '/=' operator means append to the end, docs: https://github.com/gruns/furl/blob/master/API.md#path
        f.path /= attachment

        # Cleanup and remove invalid characters in the url
        f.path.normalize()        

        return f.url  # returns the url as a string

Another potential issue is that the extract_table() method does not extract anything, it simple just formats the html in a way that is human readable. I won't go into depth on this, but I recommend learning CSS selectors or XPath selectors for easily pulling data from HTML.

In the date_list() method, you are trying to use the date1 variable, but have not defined it anywhere. I would break up the lambda in there, and expand it over a few lines, so you can easily read and understand what it is trying to do.

Below is the full, refactored, suggested code.

from datetime import timedelta, date
from bs4 import BeautifulSoup
import requests
from furl import furl

class WebCrawler():
    def __init__(self, st_date, end_date):
        self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
        self.st_date = st_date
        self.end_date = end_date

    def date_list(self):
        dates = []
        total_days = int((self.end_date - self.st_date).days + 1)

        for i in range(total_days):
            date = self.st_date + timedelta(days=i)
            dates.append(date.strftime(%Y-%m-%d))

        return dates

    def create_link(self, attachment):
        f = furl(self.base_url)

        # The '/=' operator means append to the end, docs: https://github.com/gruns/furl/blob/master/API.md#path
        f.path /= attachment

        # Cleanup and remove invalid characters in the url
        f.path.normalize()        

        return f.url  # returns the url as a string

    def open_link(self, link):
        response = requests.get(link)
        html = response.text
        return html

    def extract_table(self, html):
        soup = BeautifulSoup(html)
        print(soup.prettify())

    def output_to_csv(self):
        pass

date1 = date(2018, 3, 1)
date2 = date(2019, 3, 5)

test = webcrawler(st_date=date1, end_date=date2)
date_list = test.date_list()
link = test.create_link(date_list[0])
html = test.open_link(link)
test.extract_table(html)

mildmelon
  • 63
  • 1
  • 6
  • 1
    Thanks for the tips. They are really useful for me but the problem with `requests` library is that in the web page url there is a table which holds the information of the weather for some city and this library does not wait for the table to be loaded and returns the whole wep page source but the table with class name = `tablesaw-sortable`. – Masoud Masoumi Moghadam Apr 04 '19 at 13:44
  • You had the right idea using selenium for loading all external resources, including the table data. You will have to dig around the dev console on that site to find what link or JS library is being called and call it manually with requests. – mildmelon Apr 04 '19 at 21:22
  • Check this one out. I'm sure I can use your tips: https://stackoverflow.com/questions/55525113/how-can-i-make-the-phantomjs-webdriver-to-wait-until-a-specific-html-element-bei – Masoud Masoumi Moghadam Apr 04 '19 at 21:26
  • Great, I'll move over there to help. Can you at least give me an upvote on my answer, looks like you used my code but have given me anything in return. – mildmelon Apr 04 '19 at 21:34