Questions tagged [screen-scraping]

Screen-scraping, also known as web-scraping or data-scraping, is a software technique used to collect and parse information from user interfaces. If your question is specifically about scraping from websites or web-APIs, please use the [web-scraping] tag instead.

Screen-scraping, also known as web-scraping or data-scraping, is a software technique used to collect and parse information from websites. The information is scraped via a parser, for example using regular expressions or, in the case of a 3270 emulator, variants of HLLAPI.

Questions that have this tag should be directly related to gathering information from websites through the use of a parsing mechanism such as regular expressions or browser emulators such as PhantomJS. (Questions about screen-scraping using regular expressions should also be tagged .)

Because information on web pages is almost certainly organized in well-formatted , basic screen-scraping can be a simple task. In most cases, the reason for screen-scraping is to not only parse the data on the web page, but then to collect it either by reproducing it on a different web page or storing in a database.

One of the most common causes of problems in web-scraping is that the web page as seen in a browser (using DOM inspection tools) may be very different from the HTML retrieved by the web-scraping tool from the same URL. For example, there may be Javascript code that augments or modifies the contents of the page when loaded in a browser.

It is important to note that screen-scraping of websites may be against the website's individual Terms of Use, but the enforceability of these terms is unclear. Note that most major website hosts can detect ongoing screen-scraping, and can take action as if it were a Denial-of-service attack.

Historically, screen-scraping also described the technique of "scraping" data off of or on to a 3270 emulator. This technique gained some popularity shortly after the advent of such emulators. The API 3270 emulators implemented was known as HLLAPI (High Level Language Application Programming Interface), later EHLLAPI (Enhanced HLLAPI) and WinHLLAPI came into existence. Application programs would "drive" the emulator, sending simulated keystrokes and function keys, then waiting for responses.

4194 questions
197
votes
10 answers

Web scraping with Python

I'd like to grab daily sunrise/sunset times from a web site. Is it possible to scrape web content with Python? what are the modules used? Is there any tutorial available?
eozzy
  • 66,048
  • 104
  • 272
  • 428
166
votes
10 answers

Can scrapy be used to scrape dynamic content from websites that are using AJAX?

I have recently been learning Python and am dipping my hand into building a web-scraper. It's nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel. Most of the issues are solvable and…
Joseph
  • 3,899
  • 10
  • 33
  • 52
114
votes
2 answers

What's the best way of scraping data from a website?

I need to extract contents from a website, but the application doesn’t provide any application programming interface or another mechanism to access that data programmatically. I found a useful third-party tool called Import.io that provides click…
0x1ad2
  • 8,014
  • 9
  • 35
  • 48
105
votes
13 answers

PhantomJS failing to open HTTPS site

I'm using the following code based on loadspeed.js example to open up a https:// site which requires http server authentication as well. var page = require('webpage').create(), system = require('system'), t, address; page.settings.userName =…
Sreerag
  • 1,381
  • 3
  • 11
  • 16
84
votes
7 answers

How does a site like kayak.com aggregate content?

Greetings, I've been toying with an idea for a new project and was wondering if anyone has any idea on how a service like Kayak.com is able to aggregate data from so many sources so quickly and accurately. More specifically, do you think Kayak.com…
Jeff
  • 2,818
  • 3
  • 29
  • 31
81
votes
5 answers

How I can get web page's content and save it into the string variable

How I can get the content of the web page using ASP.NET? I need to write a program to get the HTML of a webpage and store it into a string variable.
kamiar3001
  • 2,646
  • 4
  • 42
  • 78
71
votes
8 answers

Executing Javascript from Python

I have HTML webpages that I am crawling using xpath. The etree.tostring of a certain node gives me this string: