Able to Web Scrape Static but not Dynamic Sites

Question

I'm trying to scrape the time of the next upcoming game from ESPN, as you can find on ESPN: https://www.espn.com/ (right now it appears to be a soccer match between Juventus and AC Milan)

I have the following python code for my webscrape:

import requests
from lxml import html
from selenium import webdriver
import chromedriver_binary

driver = webdriver.Chrome()
driver.get('https://www.espn.com/')

tree = html.fromstring(driver.page_source)

time = tree.xpath('//*[@id="news-feed"]/section[1]/header/a/div[2]/span[2]/span')

print(time)

but it returns this error:

Traceback (most recent call last):
  File "c:\Users\akash\Coding\test\scrape.py", line 9, in <module>
    tree = html.fromstring(driver.page_source)
  File "C:\Users\akash\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 679, in page_source
    return self.execute(Command.GET_PAGE_SOURCE)['value']
  File "C:\Users\akash\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "C:\Users\akash\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=83.0.4103.97)

I suspect the problem is because this is dynamic content on the ESPN website, because I was able to scrape data from another website with constant data, using the same code (except changing the URL and XPath). Could anyone help fix this error?

I've already installed each of the python libraries in the code. (Note: I've already looked at Scraping using python and xpath and Python Selenium Chrome Webdriver)

what exactly are you trying to scrape from the site? Since I'm looking at this 2 days later, obviously the content has changed, so I don't know what you are wanting to pull exactly — chitown88, Jun 15 '20 at 07:19
Since the content has changed, I've been trying to scrape from a different website. Check the comments on the answer below to see the discussion I had with Dmitry. — Computer Crafter, Jun 16 '20 at 13:38

score 1 · Accepted Answer · edited Jun 12 '20 at 20:06

1

In my case, I used a binary file downloaded from chromium.org. The code is as follows:

from lxml import html
from selenium import webdriver
from selenium.webdriver.chrome.options import Options  
chrome_options = Options()  
chrome_options.add_argument("--headless") 

driver = webdriver.Chrome(r'./chromedriver', chrome_options=chrome_options)
driver.get('https://www.espn.com/')
tree = html.fromstring(driver.page_source)
time = tree.xpath("//*[@id='news-feed']//span[@class='game-time']/text()")[0].strip()
print(time)

Where the --headless argument passed to chrome_options is optional (this just runs Chrome in its 'headless mode').

edited Jun 12 '20 at 20:06

Jeremy Caney

7,102
69
48
77

answered Jun 12 '20 at 19:49

Dmitry

334
1
10

Hi, thanks for answering. Unfortunately, I still had an issue: [link](https://drive.google.com/file/d/1EVcbrFlA15oLM4Tkpj8aUcRT_jMyTCUl/view?usp=sharing) (sorry for the weird crossout with the "test" thing) and I used the exact same code as you. My `chromedriver.exe` is inside the `chromedriver` folder which is directly inside the project folder. – Computer Crafter Jun 12 '20 at 21:21
Did you wrote the extension ".exe" of the file? Like this: driver = webdriver.Chrome(r'./chromedriver.exe', chrome_options=chrome_options) – Dmitry Jun 12 '20 at 21:28
Ok so upon seeing your comment I changed it it to `driver = webdriver.Chrome(r'./chromedriver/chromedriver.exe', options=chrome_options)`, but now only `[]` is output (I changed to something other than ESPN because the upcoming game isn't there anymore). Is this another problem with dynamic websites? I also tried scraping from another website and it worked fine. – Computer Crafter Jun 12 '20 at 22:08
There is no problem with dynamic websites and selenium. You've got empty list because there is no event anymore on ESPN. As you say same problem on another site may be result of incorrect xpath route. – Dmitry Jun 13 '20 at 07:49
Interesting. On that same site, I tried scraping another element, and that was able to work, but when I tried scraping the element that I need, it still returns `[]`. I have no idea why this is happening. The site I'm trying to scrape from is confidential, but for reference, the element that scraped successfully was an `h4`, while the element that isn't scraping is a `td`. I've tried both the short and full xpath, and also tried adding `/text()` to the end of either xpath, to no success. – Computer Crafter Jun 13 '20 at 16:08
Could you provide raw html and element that you want to scrap? – Dmitry Jun 13 '20 at 17:47
Sure: `64` with XPath `//*[@id="recent-point-data"]/tbody/tr/td[1]` is the one that wouldn't scrape, while `
Points
` with XPath `/html/body/div[7]/main/div[3]/div/div[4]/div[2]/div/div[3]/div/section/h4` scraped. – Computer Crafter Jun 13 '20 at 18:55
Please provide complete raw html of the page. It's hard to tell what's wrong by only xpath that you showed. – Dmitry Jun 13 '20 at 19:05
The page html itself is really long ~1600 lines, and giving it would remove confidentiality. However, trying this next example, it seems that every element I try to scrape on would output `[]`. I know some of these elements are displayed based on the user, so I also tried scraping the "Get Coins" at the very top and the "Keep Yourself Safe and Informed" description, still returning `[]`. – Computer Crafter Jun 13 '20 at 19:31
Hi @Dmitry have you figured anything about this Reddit issue yet? – Computer Crafter Jun 17 '20 at 18:58
Sorry but I have no reddit account. Can we try it on another site? Btw if you run the code on espn now it will give you expected results because soccer event goes right now. – Dmitry Jun 17 '20 at 20:18
Ok sure, I was able to find a new example. Go to [link](https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html). It's possible to scrape all the titles on that page, and the "Total Cases" data, but if you try scraping some of the table data, such as the data in the "New Cases by Day" table, it returns `[]`. I think finding a solution to this would work because it seems that the problem is with table data (not sure). – Computer Crafter Jun 18 '20 at 18:04
So there are all titles and text bodies of the table: `from selenium import webdriver` `from selenium.webdriver.chrome.options import Options ` `driver = webdriver.Chrome(r'./chromedriver')` `driver.get("https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html")` `titles = driver.find_elements_by_xpath("//section[@class='titleCallout']/h4")` `bodies = driver.find_elements_by_xpath("//section[@class='titleCallout']/p")` `titles_text = [h.text for h in titles if h.text]` `bodies_text = [b.text for b in bodies if b.text]` print(titles_text) print(bodies_text) – Dmitry Jun 18 '20 at 18:53
Sorry for my previous comment. I tried to write all code here but comments are not formatted well. So you can get all title elements like: `titles = driver.find_elements_by_xpath("//section[@class='titleCallout']/h4")` and all text bodies elements like: `bodies = driver.find_elements_by_xpath("//section[@class='titleCallout']/p")` and then get text like: `titles_text = [h.text for h in titles if h.text]` and `bodies_text = [b.text for b in bodies if b.text]` – Dmitry Jun 18 '20 at 19:01
That works, and it worked initially anyway with titles. But the problem is with the table data. For example, go to the "View Data" bar below the graph of the "New Cases by Day" section, and expand the data. If I try scraping the "6/18/2020" data, it still outputs `[]`. – Computer Crafter Jun 19 '20 at 17:51
1

No problems with this too. Table data you want to parse is in iframe so firstly you should switch to the iframe: `driver.switch_to.frame("cdcCharts3")` Then find all headers: `table_headers = driver.find_elements_by_xpath('//*[@id="cdc-chart-1-data"]/thead/tr/th')` and finally get headers text like: `table_headers_text = [h.get_attribute('textContent') for h in table_headers]` and you get an array of headers data. Then you can get all table rows data but no need to switch to iframe again because you already in. But if you want to parse another table data then you should switch to another frame – Dmitry Jun 19 '20 at 19:03
Thanks! It worked on the website I needed too! Just a couple additional questions: 1. I want to display the output from this code onto a webpage, so how would I do that? 2. I also need to scrape some data from a couple of other websites and display onto the webpage - is there a way to do this all in the same python file? – Computer Crafter Jun 19 '20 at 19:48
1

1. There are a lot of ways to do that but it is a topic for another discussion. 2. All sites are different in markup so general ways will work but you need to deal with each individually. Please mark my answer as helpful and up vote my comment because they answers on all of your questions. – Dmitry Jun 20 '20 at 06:28

Able to Web Scrape Static but not Dynamic Sites

1 Answers1

Points