Questions tagged [phpcrawl]

PHPCrawl is a framework for crawling/spidering websites written in the programming language PHP.

PHPCrawl is a framework for crawling/spidering websites written in the programming language PHP.

28 questions
3
votes
1 answer

Are there any free php crawlers?

In the past I have used my own crawler, but now I need something more robust and I was wondering if there were any good free php crawlers?
HappyDeveloper
  • 12,480
  • 22
  • 82
  • 117
2
votes
1 answer

PHP web crawler, data structure and storage, Will it work with PHPCrawl?

If there are other classes written to do this, a link would be awesome. If not, how can I do it with PHPCrawl? Is it possible to store specific information from a crawled site based upon a set of rules specific to the site? Ex., [div.wantThis,…
Douglas
  • 1,238
  • 5
  • 15
  • 27
2
votes
1 answer

Scraping multiple single pages from different domains(mostly) with different structure

I have a list of very specific urls that I need to scrape data from (different selectors/fields). There are total of around 1000 links from around 300 different websites that have different structure (selector/xpath). I am trying see if anyone has…
SorishK
  • 21
  • 2
2
votes
3 answers

Count the number of pages in a site

I'd like to know how many public pages there are in a site, say for example, smashingmagzine.com. Is there are way to count the number of pages?
Gaurav Sharma
  • 4,032
  • 14
  • 46
  • 72
2
votes
0 answers

PHPCrawl cookies / password authentication

I have a question about PHPCrawl for crawling a password protected site for which I have a password. So, I have the crawler that works for websites that do not need authentication. The crawler I execute from the terminal (ubuntu 14.04). But when I…
DShterio
  • 21
  • 4
2
votes
2 answers

PHPCrawl with simplehtmldom to parse data

I'm trying to use PHPCrawl to crawl and collect URL(s), then feed to simplehtmldom to pull the required data from the html and store in a mysql database. Right now I am getting the error ** Fatal error: Call to undefined method…
partstaxi
  • 23
  • 1
  • 3
2
votes
4 answers

Anyone has worked with a PHP API to read 'Nutch search engine' crawl results?

I have set up 'Nutch search engine' to crawl websites. Now,I need to write an php API to talk to the Nutch search engine. I need to do 2 things: using a PHP script I need to specify to Nutch as to which URLs to crawl (for this I have some pointers…
Annibigi
  • 5,895
  • 5
  • 23
  • 21
1
vote
1 answer

How do I use setTmpFile() method of phpcrawl class?

I am using this WebCrawler class http://phpcrawl.cuab.de. There is a method named "setTmpFile()" http://phpcrawl.cuab.de/classreference.html#settmpfile. I want to know that how can I use this method? Please suggest me some Good example.
Seek Php
  • 163
  • 2
  • 3
  • 12
1
vote
0 answers

get certain data from pages using crawler

I am looking to use a crawler to fetch data from a site, I found How do I make a simple crawler in PHP? and it was helpfull but I am looking to use the code on http://findpeopleonplus.com/ to get all the google plus links from the pages. I will…
ahoura
  • 689
  • 1
  • 6
  • 16
1
vote
0 answers

PHPCRAWL - How to add a filter for specific link names?

I'm using as a web crawler http://phpcrawl.cuab.de for one of my projects and it`s working fine so far, except that I don't know how to exclude or skip links with a specific name. There are rules I use already to ignore specific file…
Oliver
  • 156
  • 1
  • 13
1
vote
1 answer

Using phpcrawl with Laravel 5.4

I am trying to use cuab's PHPCrawl within Laravel 5.4 and have included it through composer using this package: https://packagist.org/packages/mmerian/phpcrawl I have tried running this sample code: class MyCrawler extends PHPCrawler { function…
Kieran Headley
  • 647
  • 1
  • 7
  • 21
1
vote
1 answer

https host unreachable when crawling with PHPCrawler

When trying to crawl a website with https protocol PHPCrawler returns an error, saying Error connecting to https://www.something.com: Host unreachable (). However it does crawl sites with http:// protocol. My Question is why is this happening, and…
JayKandari
  • 1,228
  • 4
  • 16
  • 33
1
vote
0 answers

PHP crawler Detect that a link causes a file download

I'm developing a php crawler and i can get all of link's href in page. i don't want to save url of file download link in my database, such as…
Manian Rezaee
  • 1,012
  • 12
  • 26
1
vote
1 answer

Optimize crawler script on cronjob

i have about 66Million domains in a MySQL table, i need to run crawler on all the domains and update the row count = 1 when the crawler completed. the crawler script is in php using php crawler library here is the script. set_time_limit(10000); …
Wasif Khalil
  • 2,217
  • 9
  • 33
  • 58
1
vote
0 answers

How do I use PHPCrawl to retrieve specific data from site

I am using the PHPCrawl for a website I would like to receive the data from, but I do not know where to start with retrieving data from (eg) a span with a specific class. per example I would like to retrieve the name "Jan" from this span:
BonifatiusK
  • 2,281
  • 5
  • 29
  • 43
1
2