1

I'm scraping data on cryptographers for a research project I'm doing for school. I have a really simple class that goes to a webpage, enters each of that page's href links, and writes them to a file.

I'm not actually getting a specific error when I run the code, but right now it just writes a blank file. My issue seems to be that my getters and setters have no knowledge of my private instance variables, and furthermore, my object ($obj) seems to have no knowledge of my getters and setters so I'm a bit confused.

I'm using JetBrains PHPStorm. Thanks to everyone for the help and support

Edit: I've updated the code below and it will run just fine. For anyone interested in using it - this code will scrape all of the links off of a web page and store the contents of each link inside of a file. I'm probably going to end up modifying this to strip out all html so that I only get raw text and then JSON-encode the output so that it can be easily parsed.

<?php
class Scraper
{

    /*
    =============================================
    SET UP THE BASE DIRECTORY FOR SCRAPING,
    AND OPEN FILES TO WRITE INFORMATION TO
    ==============================================
    */

    private $basedir; //BASE DIRECTORY PATH FOR SCRAPING
    private $outfile; //NAME OF FILE TO WRITE TO

    /*
    =============================================
    SETTER FOR BASE DIRECTORY
    ==============================================
    */

    public function setBaseDirectory($base)
    {
        $this->basedir = $base;
    }

    /*
    =============================================
    SETTER FOR OUTFILE
    ==============================================
    */

    public function setOutfile($file)
    {
        $this->outfile = $file;
    }

    /*
    =============================================
    GETTER FOR OUTFILE
    ==============================================
    */

    public function getOutfile()
    {
        return $this->outfile;
    }

    /*
    =============================================
    GETTER FOR BASE DIRECTORY
    ==============================================
    */

    public function getBaseDirectory()
    {
        return $this->basedir;
    }


    /*
    =============================================
    THIS FUNCTION TAKES THE HYPERLINKS OUT OF
    A WEB PAGE AND RETURNS THEM IN AN ARRAY.
    ITS SCOPE IS PRIVATE SINCE IT IS A HELPER
    METHOD FOR GETDIRCONTENTS
    =============================================
    */
    private function grabLinks($contents)
    {

        $last_dir = array();
        $URLs = array();

        preg_match_all("|href=[\"'](.*?)[\"']|", $contents, $match);

        foreach ($match as $key => $value)
            foreach ($value as $key2 => $TheUrl)
                $URLs[] = $TheUrl;

        for ($i =0; $i < (count($URLs)/2);$i++)
        {
            $item = str_replace('href=','',(string)$URLs[$i]);
            $item = str_replace('"','',$item);
            array_push($last_dir, $item);
        }

        return $last_dir;
    }


    /*
    =============================================
    THE GOAL OF THIS FUNCTION IS TO GET THE
    CONTENTS OF EACH FORUM POST AND WRITE THEM
    INTO A FILE. MAY EXPLORE CREATING AN
    ASSOCIATIVE ARRAY AND JSON_ENCODING THEM
    BASED ON NAME = POST NAME VALUE = FILE CONTENTS
    =============================================
    */
    public function getDirContents($dir)
    {

        $contents = file_get_contents($dir);
        $linksArray = $this->grabLinks($contents);
        for ($i = 0; $i < count($linksArray);$i++)
        {
            $contents = strip_tags(file_get_contents($dir.$linksArray[$i])); //GET CONTENTS OF FILE FROM LINK
            fwrite($this->getOutfile(), $contents);
            $debug = fopen("debugLog.txt", "w");
            fwrite($debug, "debug contents: \n\n".$this->getOutfile().$this->getBaseDirectory()." $contents \n\n");
        }
    }
}

/*
=============================================
CREATE NEW INSTANCE OF CLASS AND CALL FUNCTION
TO GET CONTENTS OF DIRECTORY ITEMS
==============================================
*/
$obj = new Scraper();
$obj->setBaseDirectory("http://satoshi.nakamotoinstitute.org/posts/");
$obj->setOutfile(fopen("Satoshi_Forum_Posts.txt", "w"));
$obj->getDirContents($obj->getBaseDirectory());
echo $obj->getBaseDirectory();
echo $obj->getOutfile();
Robert
  • 981
  • 1
  • 15
  • 24
  • 2
    What is the exact error from php? – Luceos Mar 11 '15 at 20:42
  • 2
    It's not really clear exactly what your error is, but I did notice this: `fwrite($debug, "debug contents: \n\n".$this->getOutfile().$this->getBaseDirectory." $contents \n\n");` You're missing the ()'s for the `getBaseDirectory` call. – Dan Smith Mar 11 '15 at 20:43
  • 1
    also, you are calling `getBaseDirectory` without passing in an argument when it requires one based on the definition (even though the argument is not used). – Jonathan Kuhn Mar 11 '15 at 20:49
  • Oops! Those are good catches and dumb mistakes on my part. The error I'm getting is that my getters and setters can't seem to see my private instance variables. Also, at the bottom when I call $obj->getBaseDirectory() my IDE tells me that there is no function called getBaseDirectory even though there obviously is. Thanks for the help! – Robert Mar 11 '15 at 21:05
  • 1
    Please provide "the exact error" like luceos requested. You paraphrasing the error doesn't help. Specifically the error will include the filename, the line number and what the problem is. Us searching for "the error I'm getting is that my..." won't give any results. Also, the error happening in your IDE sounds more like an issue with the IDE and not PHP itself. Have you tried just running the code? – Jonathan Kuhn Mar 11 '15 at 21:07
  • 1
    PHP does have some quirks, but it's best not to get into the habit of blaming a language, especially if you're new to it `:-)`. (Please also amend the question with the issues you've acknowledged - at the moment I can't throw this into codepad.org for testing, because of those hiccups). – halfer Mar 11 '15 at 21:08
  • 1
    (Aside: regular expressions for parsing HTML [tend to be discouraged](http://stackoverflow.com/a/1732454/472495). Use a proper HTML parser instead, like DOMDocument). – halfer Mar 11 '15 at 21:14
  • Ok sorry for not being more clear everyone and thanks for the help. I'm actually not getting a specific error - the code will actually run fine and write a blank file. I will try to rephrase the question to be more clear as requested. Also, for the record I actually find PHP to be a nice backend language for simple applications I just tend to prefer other languages for OOP. I'll continue to try to test things out to see if it really is my IDE and not PHP itself. – Robert Mar 11 '15 at 21:18
  • 2
    It's working on my machine, however you get a crap ton of 404 as this line `$contents = file_get_contents($dir.$linksArray[$i]);` tries to fetch URLs like `http://satoshi.nakamotoinstitute.org/posts/http://nakamotoinstitute.org/mempool/` – robbmj Mar 11 '15 at 21:38
  • 2
    You may not be getting fatal errors but the script generates many, many notices and warnings. – robbmj Mar 11 '15 at 21:39

1 Answers1

0

Ok, I've been able to locate the source of the problem and I apologize for wasting the time of those individuals who were kind enough to comment above. It turns out that my PHP code was just fine and ran after I made 1 change.

I just started using JetBrains PHPStorm IDE, and loaded this class file into the editor from my desktop rather than the JetBrains' workspace. Once I incorporated the small syntactical changes mentioned by Bulk and Jonathan Kuhn I created a new project in JetBrains inside of the workspace I defined upon setting up the program and all of the warning messages went away (I still don't fully understand why they went away).

I ran my code and produced the desired result. I'll post the updated code in the question with the updates suggested in the comments so that anyone who needs a script like this can use it. Thanks again for everyone willing to help out!

Robert
  • 981
  • 1
  • 15
  • 24
  • If it was just the `strip_contents` error (a function that does not exist), you may not have error reporting turned on in your environment. Try re-introducing the error, but have `error_reporting(-1);` at the top of the page (for non-live environments only). Then try again - you should see an error. – halfer Mar 11 '15 at 22:42
  • 1
    Hey halfer, thanks for the input. I actually put that in by mistake and realized it about 5 minutes after I posted the code. I have the update posting with "strip_tags()". Good catch though, I appreciate your help! – Robert Mar 12 '15 at 00:27
  • OK, no probs. Also, if this all now works, it's best not to overwrite the code in the question, as that will no longer exhibit the original problems and will make little sense for new readers. Even after you fix things, it's best to show what the issue was in an answer (i.e. repost the code) so the question and answer still makes sense. (I suspect this question will close as unanswerable anyway, since it is not clear what the fix is - so it doesn't matter too much in this case. But worth noting in general). – halfer Mar 12 '15 at 08:23
  • 1
    Ok thanks in the future I will definitely do that. This thread is probably safe to be closed since the issue was some weird issue with my IDE that I haven't figured out. – Robert Mar 12 '15 at 16:44