0

I am trying to scrape data from Google results using BeautifulSoup in Google Colab, but whilst my code is able to return relevant data, it seems to ignore the start date/end date element and just bring up the newest 100 headlines. I have had issues setting up Selenium in Colab, and thus was wondering whether there was an alternate way of only searching in a specific date range other than just modifying the URL, or whether there was another fix. Any advice would be appreciated. Thanks.

class Scrape:

    def __init__(self, search_term, start_date, end_date):
       self.search_term = search_term
       self.start_date = start_date
       self.start_day = start_date[0]
       self.start_month = start_date[1]
       self.start_year = start_date[2]
       self.end_day = end_date[0]
       self.end_month = end_date[1]
       self.end_year = end_date[2]
       self.url = 'https://www.google.com/search?q={0}&biw=1053&bih=1138&source=lnt&tbs=cdr%3A1%2Ccd_min%3A{1}%2F{2}%2F{3}%2Ccd_max%3A{4}%2F{5}%2F{6}&tbm=nws&num=100'.format(self.search_term, self.start_month, self.start_day, self.start_year, self.end_month, self.end_day, self.end_year)
       self.filename = '{0}{1}.csv'.format(self.search_term, self.start_date) 
       self.behaviour_index = 0

    def run(self):

       response = requests.get(self.url)
       soup = BeautifulSoup(response.text, 'html.parser')
       headlines = soup.findAll('div', {'class': "BNeawe vvjwJb AP7Wnd"})
       csv_file = open(self.filename, 'w')
       csv_writer = csv.writer(csv_file)
       csv_writer.writerow(['text', 'sentiment'])

    for headline in headlines:
       headline = headline.get_text()
       csv_writer.writerow([headline,0])
       csv_writer.writerow([headline,0])
rchurt
  • 1,395
  • 1
  • 10
  • 21
bebop
  • 1
  • 2
  • Did you try these tips for setting up Selenium (https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com) (https://medium.com/@darektidwell1980/using-selenium-with-google-colaboratory-ca4a4f21021f)? – rchurt Jul 02 '20 at 01:12
  • Yes, it doesn't seem to be working for me – bebop Jul 02 '20 at 01:26
  • What did you try and what error did you get? It might be easier to fix that than to come up with a workaround. But if you want to change the URL, I would use string interpolation (https://stackabuse.com/python-string-interpolation-with-the-percent-operator/) – rchurt Jul 02 '20 at 01:30
  • For selenium, I changed my run method to `def run(self): # open it, go to a website, and get results driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options) driver.get(self.url) responses = driver.find_elements_by_class_name("BNeawe vvjwJb AP7Wnd") for response in responses: print(response.text)` but this wasn't returning any results – bebop Jul 02 '20 at 01:47
  • Sorry, I'm a bit new to this stuff so could you explain how string interpolation would be applied in this situation? Thanks – bebop Jul 02 '20 at 01:49
  • Sure, I'm not sure what the format for the URL would be, but you would first do something like `date = '2020/07/01'`, `time = '6:55'` inside a `for` loop, and then to make the URL you could do something like: `'https://www.google.com/search?q=%s_%s' % (date, time)` which would result in something like `'https://www.google.com/search?q=2020/07/01_6:55'`. And you could iterate through several in the loop. Haven't tested any part of that, just meant to illustrate the concept. – rchurt Jul 02 '20 at 01:56

0 Answers0