I'm scraping data on cryptographers for a research project I'm doing for school. I have a really simple class that goes to a webpage, enters each of that page's href
links, and writes them to a file.
I'm not actually getting a specific error when I run the code, but right now it just writes a blank file. My issue seems to be that my getters and setters have no knowledge of my private instance variables, and furthermore, my object ($obj) seems to have no knowledge of my getters and setters so I'm a bit confused.
I'm using JetBrains PHPStorm. Thanks to everyone for the help and support
Edit: I've updated the code below and it will run just fine. For anyone interested in using it - this code will scrape all of the links off of a web page and store the contents of each link inside of a file. I'm probably going to end up modifying this to strip out all html so that I only get raw text and then JSON-encode the output so that it can be easily parsed.
<?php
class Scraper
{
/*
=============================================
SET UP THE BASE DIRECTORY FOR SCRAPING,
AND OPEN FILES TO WRITE INFORMATION TO
==============================================
*/
private $basedir; //BASE DIRECTORY PATH FOR SCRAPING
private $outfile; //NAME OF FILE TO WRITE TO
/*
=============================================
SETTER FOR BASE DIRECTORY
==============================================
*/
public function setBaseDirectory($base)
{
$this->basedir = $base;
}
/*
=============================================
SETTER FOR OUTFILE
==============================================
*/
public function setOutfile($file)
{
$this->outfile = $file;
}
/*
=============================================
GETTER FOR OUTFILE
==============================================
*/
public function getOutfile()
{
return $this->outfile;
}
/*
=============================================
GETTER FOR BASE DIRECTORY
==============================================
*/
public function getBaseDirectory()
{
return $this->basedir;
}
/*
=============================================
THIS FUNCTION TAKES THE HYPERLINKS OUT OF
A WEB PAGE AND RETURNS THEM IN AN ARRAY.
ITS SCOPE IS PRIVATE SINCE IT IS A HELPER
METHOD FOR GETDIRCONTENTS
=============================================
*/
private function grabLinks($contents)
{
$last_dir = array();
$URLs = array();
preg_match_all("|href=[\"'](.*?)[\"']|", $contents, $match);
foreach ($match as $key => $value)
foreach ($value as $key2 => $TheUrl)
$URLs[] = $TheUrl;
for ($i =0; $i < (count($URLs)/2);$i++)
{
$item = str_replace('href=','',(string)$URLs[$i]);
$item = str_replace('"','',$item);
array_push($last_dir, $item);
}
return $last_dir;
}
/*
=============================================
THE GOAL OF THIS FUNCTION IS TO GET THE
CONTENTS OF EACH FORUM POST AND WRITE THEM
INTO A FILE. MAY EXPLORE CREATING AN
ASSOCIATIVE ARRAY AND JSON_ENCODING THEM
BASED ON NAME = POST NAME VALUE = FILE CONTENTS
=============================================
*/
public function getDirContents($dir)
{
$contents = file_get_contents($dir);
$linksArray = $this->grabLinks($contents);
for ($i = 0; $i < count($linksArray);$i++)
{
$contents = strip_tags(file_get_contents($dir.$linksArray[$i])); //GET CONTENTS OF FILE FROM LINK
fwrite($this->getOutfile(), $contents);
$debug = fopen("debugLog.txt", "w");
fwrite($debug, "debug contents: \n\n".$this->getOutfile().$this->getBaseDirectory()." $contents \n\n");
}
}
}
/*
=============================================
CREATE NEW INSTANCE OF CLASS AND CALL FUNCTION
TO GET CONTENTS OF DIRECTORY ITEMS
==============================================
*/
$obj = new Scraper();
$obj->setBaseDirectory("http://satoshi.nakamotoinstitute.org/posts/");
$obj->setOutfile(fopen("Satoshi_Forum_Posts.txt", "w"));
$obj->getDirContents($obj->getBaseDirectory());
echo $obj->getBaseDirectory();
echo $obj->getOutfile();