Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

Retrieving product or stock prices comparison for comparison,
Contact scraping and collecting email addresses,
Site mashup or building an alternative front-end for an existing site,
Collection of real-estate pricing or auto sales statistics,
Website change detection
Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

web-scraping is most often tagged along with:

➡ ^{python ( including beautifulsoup, scrapy and selenium )}
➡ ^{javascript ( including node.js and phantomjs )}
➡ ^{r ( including rvest )}
➡ ^selenium
➡ ^{xml ( including xpath )}
➡ ^{java ( including jsoup )}
➡ ^php
➡ ^{vba (including vba-excel)}

A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.

Wikipedia: Web scraping
^{Overview of the types of web scraping, as well as techniques, software, legalities and prevention.}
GitHub: Guide to Preventing Web Scraping
^{Detailed advice on prevention of web scraping. (Original article on Stack Overflow)}
HartleyBrody: I Don't Need No Stinking API
^{A commercial blogger's views, advice and tips for scraping}
Stack Overflow: Scraping data from website using VBA
^{Discussion and examples of getting started scraping with VBA}
SitePoint: Web Scraping for Beginners
^{Theory and examples for beginner to web scraping}

49536 questions

646

votes

19 answers

How to find elements by class

I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup. The code looks like this soup = BeautifulSoup(sdata) mydivs = soup.findAll('div') for div in mydivs: if (div["class"] == "stylelistrow"): print div I…

python html web-scraping beautifulsoup

asked Feb 18 '11 at 11:58

Neo

13,179
18
55
80

378

votes

3 answers

Headless Browser and scraping - solutions

I'm trying to put list of possible solutions for browser automatic tests suits and headless browser platforms capable of scraping. BROWSER TESTING / SCRAPING: Selenium - polyglot flagship in browser automation, bindings for Python, Ruby, …

selenium web-scraping scrapy phantomjs casperjs

asked Aug 30 '13 at 18:38

Inoperable

1,429
5
17
33

336

votes

26 answers

How do I prevent site scraping?

I have a fairly large music website with a large artist database. I've been noticing other music sites scraping our site's data (I enter dummy Artist names here and there and then do google searches for them). How can I prevent screen scraping? …

html web-scraping architecture piracy-prevention

asked Jul 01 '10 at 20:49

pixel

3,509
3
17
7

275

votes

18 answers

How can I scrape a page with dynamic content (created by JavaScript) in Python?

I'm trying to develop a simple web scraper. I want to extract plain text without HTML markup. My code works on plain (static) HTML, but not when content is generated by JavaScript embedded in the page. In particular, when I use…

javascript python web-scraping python-2.x

asked Nov 08 '11 at 11:13

mocopera

2,803
3
18
10

274

votes

26 answers

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org

I'm practicing the code from 'Web Scraping with Python', and I keep having this certificate problem: from urllib.request import urlopen from bs4 import BeautifulSoup import re pages = set() def getLinks(pageUrl): global pages html =…

python web-scraping beautifulsoup scrapy ssl-certificate

asked May 08 '18 at 14:32

Catherine4j

2,772
2
8
10

269

votes

7 answers

How can I pass variable into an evaluate function?

I'm trying to pass a variable into a page.evaluate() function in Puppeteer, but when I use the following very simplified example, the variable evalVar is undefined. I can't find any examples to build on, so I need help passing that variable into the…

javascript web-scraping evaluate puppeteer

asked Sep 07 '17 at 05:17

Cat Burston

2,833
2
12
10

264

votes

6 answers

How can I get the Google cache age of any URL or web page?

In my project I need the Google cache age to be added as important information. I tried to search sources for the Google cache age, that is, the number of days since Google last re-indexed the page listed. Where can I get the Google cache age?

html url hyperlink web-scraping

asked Dec 30 '10 at 06:06

Tokendra Kumar Sahu

3,524
11
28
29

208

votes

3 answers

How can I efficiently parse HTML with Java?

I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation. Now, I want to separate both the tasks. I want to use a light HTML parser because it takes much time in…

java html parsing html-parsing web-scraping

asked Jan 30 '10 at 16:52

Amit

33,847
91
226
299

203

votes

18 answers

How to save an image locally using Python whose URL address I already know?

I know the URL of an image on Internet. e.g. http://www.digimouth.com/news/media/2011/09/google-logo.jpg, which contains the logo of Google. Now, how can I download this image using Python without actually opening the URL in a browser and saving the…

python web-scraping

asked Nov 27 '11 at 14:46

Pankaj Vatsa

2,599
5
24
28

197

votes

10 answers

Web scraping with Python

I'd like to grab daily sunrise/sunset times from a web site. Is it possible to scrape web content with Python? what are the modules used? Is there any tutorial available?

python web-scraping screen-scraping

asked Jan 17 '10 at 16:06

eozzy

66,048
104
272
428

185

votes

9 answers

How to use Python requests to fake a browser visit a.k.a and generate User Agent?

I want to get the content from this website. If I use a browser like Firefox or Chrome I could get the real website page I want, but if I use the Python requests package (or wget command) to get it, it returns a totally different HTML page. I…

python web-scraping python-requests wget user-agent

asked Dec 26 '14 at 03:29

user1726366

2,256
4
15
17

183

votes

16 answers

retrieve links from web page using python and BeautifulSoup

How can I retrieve the links of a webpage and copy the url address of the links using Python?

python web-scraping hyperlink beautifulsoup

asked Jul 03 '09 at 18:29

NepUS

1,899
2
14
9

181

votes

12 answers

Problem HTTP error 403 in Python 3 Web Scraping

I was trying to scrape a website for practice, but I kept on getting the HTTP Error 403 (does it think I'm a bot)? Here is my code: #import requests import urllib.request from bs4 import BeautifulSoup #from urllib import urlopen import re webpage =…

python http web-scraping http-status-code-403

asked May 18 '13 at 17:47

Josh

3,231
8
37
58

162

votes

4 answers

Scraping html tables into R data frames using the XML package

How do I scrape html tables using the XML package? Take, for example, this wikipedia page on the Brazilian soccer team. I would like to read it in R and get the "list of all matches Brazil have played against FIFA recognised teams" table as a…

html r xml parsing web-scraping

asked Sep 08 '09 at 18:27

Eduardo Leoni

8,991
6
42
49

158

votes

10 answers

can we use XPath with BeautifulSoup?

I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody': import urllib import urllib2 from BeautifulSoup import BeautifulSoup url = …

python web-scraping xpath beautifulsoup urllib

asked Jul 13 '12 at 06:55

Shiva Krishna Bavandla

25,548
75
193
313

2 3

…

99 100 Next

Questions tagged [web-scraping]

A note on spelling

Further Reading: