0

There's a PHP based website that I'd like to replicate the data from.

The problem is that the website's data is only accessible via a company name search page - www.example.com/companynamesearch.php

The results are displayed under the same URL, so it does not have separate company name URLs to crawl for data.

Can anyone suggest an easy way to extract the data from the site?

Thanks

user2565123
  • 293
  • 1
  • 5
  • 12
  • 2
    There's a separate URL for each page. It's probably just buried in the javascript as an ajax call. If you give us the real website we might be able to give specific help. Also, if you really need to pull data from an HTML page, consider something like YQL: https://developer.yahoo.com/yql/ – Jonathan M Jun 02 '14 at 22:20
  • You're correct, Brad. That's why I said probably. – Jonathan M Jun 02 '14 at 22:20
  • "replicate the data from" = steal? –  Jun 02 '14 at 22:21
  • if all you have is a search engine, you'd have to throw a few bajillion search terms at it and scrape the results. Scraping sites like that tends to be frowned on by the site's owners... in other words, you'd be better off contacting the operators and negotiate a feed. otherwise you're just stealing. – Marc B Jun 02 '14 at 22:22
  • "Written permission is required to duplicate any of the content within this site." those who can do, those who can't scape other peoples sites :( –  Jun 02 '14 at 22:29
  • @Mark B - Thanks Mark. As it's publicly available data, I'd be inclined to suggest it wouldn't be stealing in the criminal sense. But I'll take a look at your suggestions and see if there's a better way of going about this – user2565123 Jun 02 '14 at 22:30
  • @Dagon Ah yes, I didn't see that. Thanks – user2565123 Jun 02 '14 at 22:31
  • Yeah, @user2565123, it's stealing their data. You would do better to contact them and ask for it. Save the time, hassle and conscience. – Jonathan M Jun 02 '14 at 22:32
  • @JonathanM I will do, thanks. This is as much of a learning exercise for me as well, I'm fairly new to PHP. So is still interesting to read the replies. – user2565123 Jun 02 '14 at 22:43

3 Answers3

1

First, you need to query the data. Figure out if the data is truly on this page and the data comes in via AJAX as suggested by @JonathanM. You can use a tool like Fiddler or your browser's developer tools to monitor for this.

If you find the data comes in via AJAX, you're all set. It's probably JSON, but can be in any type so watch for that.

If the data is on this page and the page is queried by POST data, then you are going to have to make those POST requests and then parse the page. Now, don't do this yourself. Use DOMDocument to dig at the page for you. See this question for details: How do you parse and process HTML/XML in PHP?

Community
  • 1
  • 1
Brad
  • 159,648
  • 54
  • 349
  • 530
1

If your chosen language is php you should look at curl's automated form submission capabilities, which will enable you to automate the internal search engine's form.

There is a useful stackoverflow answer here fill out a form automaticly using curl and php

Or you can look at these basic tutorials to get you started: http://phpsense.com/2007/php-curl-functions/ http://devzone.zend.com/160/using-curl-and-libcurl-with-php/

Using curl with php will save you plenty of time but be warned, if the site's owners aren't wanting you to scrape their site, you could be in for a tough time. And of course there are copyright issues to think of, etc, etc.

Community
  • 1
  • 1
Hektor
  • 1,845
  • 15
  • 19
  • Glad to assist. Actually this is prolly the best link of all: http://www.catswhocode.com/blog/10-awesome-things-to-do-with-curl – Hektor Jun 02 '14 at 22:36
0

Have you tried searching google for site:www.example.com ? You may get a list of all pages back.

They might have submitted a sitemap or Google might have found another way.

Joan-Diego Rodriguez
  • 2,439
  • 1
  • 27
  • 29