2

I'm a student and new to Python. I would like to download pdf files (these are financial reports from different organizations) from a website, but before this I have to go through some steps. Here's the website that I'm dealing with: http://sprawozdaniaopp.mpips.gov.pl/ There are many organizations here, so I thought that it would be good to download pdfs with script. Firstly, my script clicks on Search button (without any criteria - to find all) -> as an effect whole list of links loads. When I click on link -> smaller window appears on the same site (this window refers only to organization that I clicked in). And - here's the problem - my script can't switch to that window. I was searching through the internet and found driver.switch_to.window or driver.switch_to.frame functions, but it didn't work or I didn't use it correctly. I'm afraid that this is not any frame but ui-dialog(?). When I clicked right button on this window and examined this window I found something like that:

<div class="ui-dialog ui-widget ui-widget-content ui-corner-all" tabindex="-1" role="dialog" aria-labelledby="ui-dialog-title-2" style="display: block; z-index: 1002; outline: 0px; height: auto; width: 600px; top: 234.5px; left: 328px;"><div class="ui-dialog-titlebar ui-widget-header ui-corner-all ui-helper-clearfix"><span class="ui-dialog-title" id="ui-dialog-title-2">Szczegółowe informacje o organizacji</span><a href="#" class="ui-dialog-titlebar-close ui-corner-all" role="button"><span class="ui-icon ui-icon-closethick">close</span></a></div><div style="width: auto; min-height: 0px; height: 401.896px;" class="ui-dialog-content ui-widget-content" scrolltop="0" scrollleft="0"> (...)

A don't know how to tell my script to switch to this kind of dialog window (?) to enable it search for link "Sprawozdanie merytoryczne" only for 2016 year.

Strange thing with this site is that when I check the link, there is for example : http://sprawozdaniaopp.mpips.gov.pl/Search/Details/0000000168 it could be opened only clicking on it left button. When I try to open it in new tab it is impossible ( why ?). The effect is below: "Server Error in '/' Application. The resource cannot be found. Description: HTTP 404. The resource you are looking for (or one of its dependencies) could have been removed, had its name changed, or is temporarily unavailable. Please review the following URL and make sure that it is spelled correctly. "

Here is my script in Python:

import urllib
import urllib.request
import requests
import re

url = "http://sprawozdaniaopp.mpips.gov.pl/Search/Print/13313?reporttypeId=13"


r = requests.get(url)
#with open(r'C:\Users\username\Desktop\financialreport1.pdf', 'wb') as f:
#       f.write(r.content)

from selenium import webdriver

chrome_path= r"C:\Users\username\AppData\Local\Programs\Python\Python35-32\Scripts\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("http://sprawozdaniaopp.mpips.gov.pl/")

#Button Search called here in polish "Znajdź"
elem = driver.find_element_by_xpath("//*[@id='btnsearch']/span") 
elem.click()

#testing if I'm able to find links on this website 
#elems = driver.find_elements_by_xpath("//a[@href]")
#for elem in elems:
    #print (elem.get_attribute("href"))

#Clicking on first link ( in future I wanted to do it in loop for every link
#elem1 = driver.find_element_by_xpath("//*[@id='form1']/div/div[4]/table/tbody/tr[1]/td[3]/a")
elem1 = driver.find_element_by_css_selector("#form1 > div > div.grid > table > tbody > tr:nth-child(1) > td:nth-child(3) > a")
elem1.click()

#doesn't work
#driver.switch_to.window("#form1 > div > div.grid > table > tbody > tr:nth-child(1) > td:nth-child(3) > a")

#below doesn't work because I can't switch to window where elem2 is placed
elem2 = driver.find_element_by_css_selector("body > div.ui-dialog.ui-widget.ui-widget-content.ui-corner-all > div.ui-dialog-content.ui-widget-content > table:nth-child(4) > tbody > tr:nth-child(7) > td:nth-child(1) > a")
elem2.click()

I attach some screens to illustrate my problem. I would be very grateful for any piece of advice or some key words that I should look for (maybe the case is obvious and I don't understand it as a newbie). Greetings!

partial list of organizations wanted pdf document which opens in new tab after clicking on yellow link

ou_ryperd
  • 2,037
  • 2
  • 18
  • 23
ejdi
  • 45
  • 4
  • When I click on Search button, then click on any link on this page, I got an error : `Server Error in '/' Application. The resource cannot be found. Description: HTTP 404.`. Seems that this page is broken. – krokodilko Nov 19 '17 at 15:29
  • I checked it now and you are right - now it's broken. Yesterday it was broken for some time, too. But in times when it is working, this error doesn't happen if I click on link ( unless I try to open it in new window/tab). I don't understand it :( – ejdi Nov 19 '17 at 15:50
  • @krokodilko Once you click search button - it's good to wait for a moment ( to load full page). After that you can click on any link - and the window (as in attached picture) should open :) – ejdi Nov 19 '17 at 15:56
  • You do not have to switch - this is the dialog rendered inside the HTML page, it is on thie same page. It was just hidden, and when you click on the link, this dialog appears. Which PDF do you need to download ? There are a few files like `bilans, sprawzdanie merytoryczne ...` ? – krokodilko Nov 19 '17 at 16:05
  • @krokodilko I want to download "Sprawozdanie merytoryczne" only for 2016 year. And then - similarly for each organization. Only "Sprawozdanie merytoryczne" for 2016 year. When I tried without switching python showed an error that it couldn't find an element :( – ejdi Nov 19 '17 at 17:15
  • A few more questions - seems that some records do not have a file, there is a message `Organizacja nie została jeszcze zarejestrowana w systemie`, what dou you want to do in such a case ? Files are saved with default names like `123.pdf`, this is kind of their internal ID or so, is that OK, or do you want to change that filename ? Http 400 error appears when the page is not fully loaded after the Search button was clicked, you must wait until all 1800 records appear on the page, and then Http 400 disappears. – krokodilko Nov 19 '17 at 19:22
  • @krokodilko If there's a message "Organizacja nie została jeszcze zarejestrowana w systemie" I would like to leave it and go to another organization. I want to change filename to name of organization. As final effect I would like to have files with names of organizations ( with financial information from 2016) in folder on my PC for example on my desktop ;) – ejdi Nov 19 '17 at 19:33
  • Can you sum up the `Manual Steps` you are trying to `Automate`? – undetected Selenium Nov 20 '17 at 04:46
  • Ok and what will you like to do once the `Nazwa` loads (list of organizations loads)? – undetected Selenium Nov 20 '17 at 11:15
  • @DebanjanB I want my script in Python to do steps below: 1. Go to website http://sprawozdaniaopp.mpips.gov.pl/ 2. Click on "Znajdź button" ( it means search) 3. Wait until list of organizations loads. 4. Click on first organization -> Dialog window is rendered. 5. Download "Sprawozdanie merytoryczne" only for 2016 year and save this file under name of organization. (If there's no such file skip and continue for next organization. ) 6. Close rendered window. 7. Click on next organization and go to 5th point. 8. Repeat until the end of the list. 9. Close browser – ejdi Nov 20 '17 at 11:15
  • @DebanjanB Firstly I wanted to test if I'm able to download file for first organization ( I was testing how far my script can go) and I got stuck after clicking on link of first organization. I didn't know how to access "Sprawozdanie merytoryczne". – ejdi Nov 20 '17 at 11:40
  • What do you want to do after searching out ` "Sprawozdanie merytoryczne"`? – undetected Selenium Nov 20 '17 at 13:28
  • @DebanjanB I want to download it :) – ejdi Nov 20 '17 at 13:38
  • Okay I guess you want to click the link :) – undetected Selenium Nov 20 '17 at 13:44
  • @DebanjanB Yes. I don't know if I have a good approach to whole case. Maybe it's better idea for downloading it. I thought that the script should click as I would click if I wanted to download documents manually. – ejdi Nov 20 '17 at 13:49
  • @ejdi Check out my Answer – undetected Selenium Nov 20 '17 at 13:53
  • 1
    @DebanjanB Your answer works !!! Thank you ! I wouldn't find it out on my own. It demanded to import : from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC Now I will be thinking how to proceed next steps :) – ejdi Nov 20 '17 at 14:00
  • Great news !!! Glad to be able to help you. – undetected Selenium Nov 20 '17 at 14:03
  • @DebanjanB I exited my problem (it starts from :" New problem: ..." . I can't find solution :(. I thought that it may be easy for you. It would be really grateful if you could look at it. – ejdi Aug 22 '18 at 09:36
  • @ejdi Please don't change/edit the question as it was having an accepted answer. If you change the question now it won't be useful/helpful for the future readers/audience. Instead ask/raise a new question as per your new requirement. Stackoverflow volunteers will be happy to help you out. For this time I am reverting the question to it's previous state. – undetected Selenium Aug 22 '18 at 09:43
  • @DebjanianB I'm very sorry for that. I thought that it is related so I did it that way. Thank you for the explanation. Of course you are right. – ejdi Aug 22 '18 at 09:46

1 Answers1

1

On the Website http://sprawozdaniaopp.mpips.gov.pl/ after clicking the Search button and clicking on the first link we need to wait for the Modal Box to open and then we have to click on the Sprawozdanie merytoryczne link. Here is your own code with a simple tweak as follows :

elem1 = driver.find_element_by_css_selector("#form1 > div > div.grid > table > tbody > tr:nth-child(1) > td:nth-child(3) > a")
elem1.click()
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,".ui-dialog.ui-widget.ui-widget-content.ui-corner-all")))
driver.find_element_by_link_text("Sprawozdanie merytoryczne").click()
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352