-2

I'm wondering if there's some code or library for getting all urls under a domain. I need to find all urls for a domain.

For example, if my domain is https://stackoverflow.com/ and I'd like to find all question url's like this:

  1. [Java lib or app to convert CSV to XML file?
  2. [https://stackoverflow.com/questions/456/what-can-i]
  3. [https://stackoverflow.com/questions/789/where-can-i]

I don't know about how many questions are under the domain, but I have to create an engine for searching all the urls and then after finding the urls I need to insert the content into my database.

I will create a small search engine for my 5 web pages.

Can anyone help please?

Thanks,

Community
  • 1
  • 1
  • 4
    This seems quite broad for a single question... you are writing a web crawler, which is complicated. Can you narrow your question to a specific technological issue, or are you hoping we will provide you with architecture for your program? – Chris Trahey Jul 07 '12 at 21:31
  • i will create with php but i don't know name of this job so i don't know how searcg in google. how can i search in google sample about this work? Actually bot of them can work. It is ur choice provide me a architecture or a way. i am ok with all. – user1508831 Jul 07 '12 at 21:37
  • Please elaborate on "I will create a small search engine for my 5 web pages.", if your crawling/scraping a site, why would you you have 5 pages or is this just an example number? – William Isted Jul 08 '12 at 00:50

1 Answers1

0

Lucene search allows you to easily index your pages so they can be searched efficiently and accurately.

See Zend_Search_Lucene for a PHP implementation of Lucene serach.

You still have to spider your site and build the index which is another issue. You could use a software like Teleport Pro to spider your site and give you a list of URLs which you can then feed to a PHP script that gets the contents of all the pages and feeds them to Zend_Search_Lucene to build an index. You can also write the crawler in PHP or use an existing solution. A search for php crawler yields many things, including this useful php crawler.

drew010
  • 68,777
  • 11
  • 134
  • 162
  • can i get all urls and insert to db content of pages by the php crawler? – user1508831 Jul 07 '12 at 21:50
  • Sure, once you have a list of URLs you can get their contents using a function as simple as [file_get_contents()](http://php.net/file_get_contents). Inserting the full file into the DB for searching purposes isn't really ideal though. – drew010 Jul 07 '12 at 21:52
  • so i am going search on php crawler.Also who have sample can share please? – user1508831 Jul 07 '12 at 21:53
  • I found sphider . It is really working good. It is finding all urls. Can i make my own like sphider? – user1508831 Jul 08 '12 at 14:58
  • Sure you could make your own, but why when you can use one of the many existing ones? Look at its source code and you will see it isn't trivial, you could spend days or weeks just making your spider that works well and handles edge cases. – drew010 Jul 08 '12 at 20:50
  • Actually i am not great programmer, and modifying someone codes looking more harder then your self code. If i find a goodd resource which explain, i think i can create mine. Also sphider index all words. I need index whole content of each page. So modifying it is same with creating new one :) Any one can help? – user1508831 Jul 08 '12 at 22:07