How to get the link to all the pages of a website for data scraping

Question

I have been working on a program that scrapes data from a particular page of a website using regular expression in PHP.

     <?php 
     ini_set("user_agent", "PHP");
     $url = "http://www.example.com/page.html";
     $output = file_get_contents($url);
     preg_match('#<h1 class="title" itemprop="name">(.*)</h1>#', $output, $match);
     echo $match[1] ."<br>";
     ?>

How do I make a program that gets all the existing links of the website to scrape the data from? Instead of opening every link in the browser and inserting it manually, which is worse then typing the data manually instead of scraping.

I know JavaScript, Python and PHP and can work on any of these three languages.

its time to use [`DOMDocument`](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) — Kevin, Apr 28 '16 at 23:58

score 0 · Answer 1 · answered Apr 29 '16 at 00:02

0

import bs4
for link in bs4.BeautifulSoup(urllib2.urlopen(target_url).read()).find_all("a"):
    print link

answered Apr 29 '16 at 00:02

Joran Beasley

110,522
12
160
179

How to get the link to all the pages of a website for data scraping

1 Answers1

Linked