There are quite a few notable things wrong with this code and how you are using the libraries. Let me try to fix it up.
First, I don't see you using the urllib.request
library. You can remove this, or if you are using it in another spot in your code, I recommend the highly appraised requests module. I also recommend using the requests library instead of selenium, if you are only trying to get the HTML source from a site, as selenium is more designed towards navigating sites and acting as a 'real' person.
You can use response = requests.get('https://your.url.here')
and then response.text
to get the returned HTML.
Next I noticed in the open_link()
method, you are creating a new instance of the PhantomJS
class each time you call the method. This is really inefficient as selenium uses a lot of resources (and takes a long time, depending on the driver you are using). This may be a big contributor to your code running slower than desired. You should reuse the driver
instance as much as possible, as selenium was designed to be used that way. A great solution to this would be creating the driver
instance in the webcrawler.__init__()
method.
class WebCrawler():
def __init__(self, st_date, end_date):
self.driver = webdriver.PhantomJS()
self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
self.st_date = st_date
self.end_date = end_date
def open_link(self, link):
self.driver.get(link)
html = driver.page_source
return html
# Alternatively using the requests library
class WebCrawler():
def __init__(self, st_date, end_date):
self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
self.st_date = st_date
self.end_date = end_date
def open_link(self, link):
response = requests.get(link)
html = response.text
return html
Side note: For class names, you should use CamelCase instead of lowercase. This is just a suggestion, but the original creator of python has created PEP8 to define a general style guide for writing python code. Check it out here: Class Naming
Another odd thing I found was that you are casting a string to... string. You do this at url = str(self.base_url)
. This doesn't hurt anything, but also doesn't help. I can't find any resources/links but I have a suspicion that this take extra time for the interpreter. Since speed is a concern, I recommend just using url = self.base_url
since the base url is already a string.
I see that you are formatting and creating urls by hand, but if you want a bit more control and less bugs, check out the furl library.
def create_link(self, attachment):
f = furl(self.base_url)
# The '/=' operator means append to the end, docs: https://github.com/gruns/furl/blob/master/API.md#path
f.path /= attachment
# Cleanup and remove invalid characters in the url
f.path.normalize()
return f.url # returns the url as a string
Another potential issue is that the extract_table()
method does not extract anything, it simple just formats the html in a way that is human readable. I won't go into depth on this, but I recommend learning CSS selectors or XPath selectors for easily pulling data from HTML.
In the date_list()
method, you are trying to use the date1
variable, but have not defined it anywhere. I would break up the lambda in there, and expand it over a few lines, so you can easily read and understand what it is trying to do.
Below is the full, refactored, suggested code.
from datetime import timedelta, date
from bs4 import BeautifulSoup
import requests
from furl import furl
class WebCrawler():
def __init__(self, st_date, end_date):
self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
self.st_date = st_date
self.end_date = end_date
def date_list(self):
dates = []
total_days = int((self.end_date - self.st_date).days + 1)
for i in range(total_days):
date = self.st_date + timedelta(days=i)
dates.append(date.strftime(%Y-%m-%d))
return dates
def create_link(self, attachment):
f = furl(self.base_url)
# The '/=' operator means append to the end, docs: https://github.com/gruns/furl/blob/master/API.md#path
f.path /= attachment
# Cleanup and remove invalid characters in the url
f.path.normalize()
return f.url # returns the url as a string
def open_link(self, link):
response = requests.get(link)
html = response.text
return html
def extract_table(self, html):
soup = BeautifulSoup(html)
print(soup.prettify())
def output_to_csv(self):
pass
date1 = date(2018, 3, 1)
date2 = date(2019, 3, 5)
test = webcrawler(st_date=date1, end_date=date2)
date_list = test.date_list()
link = test.create_link(date_list[0])
html = test.open_link(link)
test.extract_table(html)