0

I am trying to extract data from pdf files using the PdfParser library.

When I tried it with couple of big and moderately complex pdf files it gave me an error:

allowed memory size of 134217728 bytes

I need some permanent solution, either the library is broken or my implementation is wrong.

Here is my code:

class crawlController extends Controller
{
  public function crawler()
  {
        $dirPath = '/home/development/pdf_root';
        $this->getFileFolderTree($dirPath);
  }

    public function getFileFolderTree($rootDirectory)
    {
        $this->goIntoFolder($rootDirectory);
    }

    public function goIntoFolder($dirPath)
    {
      // Get all the direct sub folders of the root folder
      try
      {
        $dirList = File::directories($dirPath);
      }
      catch(\App\Exceptions\InvalidArgumentException $e)
      {
        require $e->getMessage();        
      }
      if(count($dirList) == 0)
      {
        //search for files now
        $this->searchFiles($dirPath);
      }
      else
      {
        // Loop through the list of diectories
        foreach ($dirList as $dir) 
        {
          // Print name of the selected directory
          echo "Folder name : ",basename($dir)," Parent Folder :",basename($dirPath),"<br/>";

          // Recursivly search selected directory
          $this->goIntoFolder($dir); 
        }
        echo "<hr><br/>";
      }
    }

    public function searchFiles($dirPath)
    {
      // Read all files
      $files = File::files($dirPath);
      $result = FALSE;

      // If no files exists
      if(count($files) > 0)
      {
        foreach ($files as $file) 
        {
          // Check if file is a pdf file.
           if(0 == strcasecmp('pdf',File::extension($file)))
           {
              // Read the file
              $this->readFileData($file);
           }
        }        
        $result = TRUE;
      }
      return $result;
    }

    public function readFileData($file)
    {   
        // Build PdfTotext object
        $parser = new \Smalot\PdfParser\Parser();
        $pdfLoad = $parser->parseFile($file);

        $content = $pdfLoad->getText();
        $txtFilename = basename($file).".txt";
        $bytesWritten = File::append($txtFilename,$content);
        if($bytesWritten)
        {
          echo "success : ",$file;
        }
        else
        {
          echo "Faliure : ",$file;
        }
        unset($parser);
    }
}
Kirk Beard
  • 9,569
  • 12
  • 43
  • 47
  • Thank s for the response , I have following questions : 1. To what value should I increase the value ? 2. Is increasing the memory value a permanent solution. – Ayush pratap Aug 05 '17 at 14:36
  • I don't know what do you mean, you clearly already have to extend it beyond 128MB. If it is too much for this server then maybe you could try to do parsing on another one. – Łukasz Zaroda Aug 05 '17 at 14:39
  • Possible duplicate of [Allowed memory size of 134217728 bytes exhausted](https://stackoverflow.com/questions/12264253/allowed-memory-size-of-134217728-bytes-exhausted) – Neobugu Aug 05 '17 at 14:41
  • Ok I understand that I have to increase it beyond 128MB , but I my pdf files are of 3.5MB only (on which I tested). So are we talking about RAM memory allocation ? – Ayush pratap Aug 05 '17 at 14:41
  • @Rhopercy : It is similar but I also want to find out that why my code is taking this much memory that just to extract data from pdf files with combined size of 3.5MB. My end goal is to extract data from pdf files combined memory ~500MB , in that case I don't how much memory I am gonna need. – Ayush pratap Aug 05 '17 at 14:47
  • You have `require $e->getMessage()`, maybe to change to `echo $e->getMessage();`? – Tpojka Aug 05 '17 at 18:42
  • @Tpojka : That was typing mistake. Thanks for pointing it out. – Ayush pratap Aug 07 '17 at 06:54

0 Answers0