1

I would like some advice on how to scrape data from this website.

I started with selenium, but got stuck at the beginning because, for example, I have no idea how to set the dates.

My code until now:

from bs4 import BeautifulSoup as soup
from openpyxl import load_workbook
from openpyxl.styles import PatternFill, Font
from selenium import webdriver
from selenium.webdriver.common.by import By
import datetime
import os
import time
import re

day = datetime.date.today().day
month = datetime.date.today().month
year = datetime.date.today().year
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit/non-usability-history'
cookieValue = '12-c12-cached|from:' +str(year)+ '-' +str(month)+ '-' +str(day-5)+ ','+'to:' +str(year)+ '-' +str(month)+ '-' + str(day) +',dateType:1,company:PreussenElektra,fuel:uranium,canceled:0,durationComparator:ge,durationValue:5,durationUnit:day'
#saving url
browser = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit'
browser.add_cookie({'name': 'tem', 'value': cookieValue})
browser.get(my_url)
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit/non-usability-history'
browser.get(my_url)

Obviously I am not asking for code, just some suggestions on how to continue with Selenium (how to set dates and other data) or any idea on how to scrape this website

Thanks in advance.

EDIT: I am trying to follow the cookie way. That is my updated code, I read that the cookie need to be created before loading the page and so I did, any idea why it is not working?

G. Bartowski
  • 115
  • 9
  • This id `affected_date` represent two elements. 1. Period reported 2. Period Concerned, which one you want to select – cruisepandey Jul 11 '18 at 13:47
  • To set the dates, you need to iterate through the div and the table inside it. That's it. I am not sure what you're asking here. Please be more specific. Thanks! – demouser123 Jul 11 '18 at 13:51
  • I have no idea how to interact with the calendar. I tried to set the date as text but it does not work. – G. Bartowski Jul 11 '18 at 13:58
  • @DavideRavera This sounds like an [X-Y problem](http://xyproblem.info/). Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do? – undetected Selenium Jul 11 '18 at 14:01
  • I am trying to modify the start date and end date in the calendar, but I thought to ask more generally because maybe there is a more straightforward method. For example I started with selenium in another situation and I ended up using json which was actually much faster. – G. Bartowski Jul 11 '18 at 14:05

2 Answers2

1

Is there any particular reason why you have decided to use selenium over other web scraping tools (scrapy, urllib, etc.)? I personally have not used Selenium but I have used some of the other tools. Below is an example of a script to just pull all the html from a page.

import urllib
import urllib2
from bs4 import BeautifulSoup as soup

link = "https://ubuntu.com"
page = urllib2.urlopen(link)
data = soup(page, 'html.parser')

print (data)

This is just a short script to pull all the HTML off a page. I believe BeautifulSoup has additional tools for inputting data into fields, but the exact method slips my mind right now, if I can find my notes on it I will edit this post. I remember it being very straightforward, though.

Best of luck!

Edit: here's a discussion web scraping tools from reddit a while back that I had saved https://www.reddit.com/r/Python/comments/1qnbq3/webscraping_selenium_vs_conventional_tools/

  • I have already used urllib but if I am not wrong I can't interact with the input forms if they are not included in the page link and I can only pull out the html and look into it – G. Bartowski Jul 11 '18 at 14:07
  • This prior post has some good information on how to input into the forms [https://stackoverflow.com/questions/13166395/fill-input-of-type-text-and-press-submit-using-python]. I looked at the page and it doesn't seem to be anything that can't be handled by "inspect element" – Zachary Brasseaux Jul 11 '18 at 14:14
1

Best approach for you will be changing cookies, because every filter data is saved in cookie.

Check cookies in chrome ( f12 -> application -> cookies ) and play with filters. If you will change it in programmers tools you have to refresh website :)

Check this post on how to change cookies in selenium python.

To get values from website you have to use classic way like u did here, but you will have to use classes:

radio = browser.find_elements_by_class_name('aaaaaa')

You can always use xPath to search elements ( chrome will generate them for you ).

Maciej Pulikowski
  • 2,457
  • 3
  • 15
  • 34
  • Cool I will look into it – G. Bartowski Jul 11 '18 at 14:12
  • Why should I look for values? I will just open the page with the cookie and then download the html and take out the table? Right? No? – G. Bartowski Jul 11 '18 at 14:49
  • I edited the question with what I did till now with cookies. Can you give it a look? – G. Bartowski Jul 11 '18 at 16:56
  • Yes, you can download the entire html, then take values by regex. My cookie starts with `12-cached|from:`, without `c-12`. And try to give 60 second wait time before you download html. Because this website has a strange loading table process. – Maciej Pulikowski Jul 12 '18 at 07:44
  • But what I see is that it finds actual data it is faster (for example same cookie, but withperiod 2018/4/2 - 2018/7/11). Should I place a sleep time of 60 seconds anyway or something like 10 seconds may be fine? – G. Bartowski Jul 12 '18 at 09:35
  • You need to experiment with time, try different filters to fit your needs. But remember you will have higher success rate with longer time. Do you need more help? If not, close thread :) – Maciej Pulikowski Jul 12 '18 at 11:59